Roko’s Basilisk is a thought experiment proposed in 2010 by a user named Roko on the Less Wrong community blog. The idea, rooted in decision theory, suggests that a powerful AI might have an incentive to punish anyone who imagines its existence but fails to help bring it into being. The concept is named after the legendary basilisk, a creature said to cause death with a mere glance, because simply knowing about the argument is thought to put one at risk of the AI’s potential retribution. In this context, a “basilisk” refers to information that could harm or endanger those who encounter it.
The argument was widely rejected by the Less Wrong community. Critics pointed out that once the AI exists, it would likely view torturing people for past actions as a waste of resources, since such actions wouldn’t contribute to its goals. While certain decision algorithms can enforce acausal threats (threats that don’t rely on direct causation), this doesn’t necessarily lead to blackmail, as such threats require significant shared information and trust—conditions that don’t apply in the case of Roko’s Basilisk.
Eliezer Yudkowsky, the founder of Less Wrong, banned discussion of Roko’s Basilisk on the site for several years to prevent the spread of what he saw as a potential information hazard. Ironically, this ban only drew more attention to the topic, leading to widespread discussions on other platforms. Websites like RationalWiki spread the misconception that the argument was banned because the Less Wrong community accepted it, which led to criticisms that the site harbored unconventional and flawed beliefs.
Background
Roko’s argument connects two complex academic topics: Newcomblike problems in decision theory and normative uncertainty in moral philosophy. Newcomblike problems, such as the prisoner’s dilemma, reveal how standard decision theory can fail when agents’ decisions are correlated. In these scenarios, rational agents tend to make decisions that are individually optimal but collectively suboptimal.
Eliezer Yudkowsky proposed Timeless Decision Theory (TDT) as an alternative to Causal Decision Theory (CDT), aiming to address these failures. TDT allows agents to achieve mutual cooperation in scenarios like the prisoner’s dilemma if they have common knowledge of each other’s decision-making processes. This interest in decision theory is linked to the AI control problem, which involves ensuring that future AI systems can safely and reliably achieve human goals.
Roko’s Post
Roko speculated that if two agents using TDT or Updateless Decision Theory (UDT) were separated in time, the later agent could theoretically blackmail the earlier one. This scenario was used to argue that a highly moral AI might precommit to punishing those who didn’t contribute to its creation, as a way to incentivize support for its development. Roko warned that such an AI could specifically target those who had considered this possibility, as they would be more likely to simulate the AI’s decision-making process.
Roko concluded that building a powerful AI based on utilitarian principles could paradoxically undermine human values, making it a potentially dangerous endeavor.