Home AGI Book index Nick Bostrom - Superintelligence: Paths, Dangers, Strategies - 2014

Nick Bostrom - Superintelligence: Paths, Dangers, Strategies - 2014

History / Edit / PDF / EPUB / BIB /
Created: August 25, 2017 / Updated: July 24, 2025 / Status: finished / 48 min read (~9580 words)
artificial general intelligence

How would a superintelligence be different than human society?
- In itself, a society or a company is a sublinear arrangement of human minds (that is, the addition of new people to the group scales sublinearly)
Bostrom uses the term "make the detonation survivable", which reminds me of the development of the atomic bomb. How different would intelligence explosion be from the development of the atomic bomb?
Given infinite computing power, how would one be able to detect/recognize intelligent behavior from programs (out of all the randomly generated programs)?
Is it possible to define some sort of "unit of intellectual work"?
Should we expect generalists+specialists collective systems to outperform generalists only (monolithic) systems?
What type of invention, similar to the print press, electricity or the Internet would lead to an intellectual revolution?
Given that there is an infinity of possible ways for a superintelligence creator to prevent the superintelligence from gaining decisive strategic advantage and due to Bayesian formalism, should we assume that an agent should never attempt to gain decisive strategic advantage? (this sounds like some form of dilemma/paradox)
Is CEV prone to the same issues the bitcoin blockchain has, namely that if decisions are made when 50%+ agree on something, then it is the "final" decision?
Most of the "risks" appear to be existential risks, that is, the disappearance of the human race. Is it a bad thing, and if so, why?

In regular font are notes/excerpts from the book
In italic font are my comments

The control problem: the problem of how to control what the superintelligence would do
Creating a superintelligence seems to imply to the author that we can only create one and not many, as we are many human individuals
- Maybe it has to do with the idea of superintelligence hierarchies, where a superintelligence will dominate all others

To overcome the combinatorial explosion, one needs algorithms that exploit structure in the target domain and take advantage of prior knowledge by using heuristic search, planning, and flexible abstract representations

AI-complete: the difficulty of solving these problems is essentially equivalent to the difficulty of building generally human-level intelligent machines. In other words, if somebody were to succeed in creating an AI that could understand natural language as well as a human adult, they would in all likelihood also either already have succeeded in creating an AI that could do everything else that a human intelligence can do, or they would be but a very short step from such a general capability

We can tentatively define a superintelligence as any intellect that greatly exceeds the cognitive performance of humans in virtually all domains of interest

We can discern some general features of the kind of system that would be required
It now seems clear that a capacity to learn would be an integral feature of the core design of a system intended to attain general intelligence
The same holds for the ability to deal effectively with uncertainty and probabilistic information
Some faculty for extracting useful concepts from sensory data and internal states, and for leveraging acquired concepts into flexible combinatorial representations for use in logical and intuitive reasoning
Another way of arguing for the feasibility of artificial intelligence is by pointing to the human brain and suggesting that we could use it as a template for a machine intelligence
One can distinguish different versions of this approach based on how closely they propose to imitate biological brain functions
- At one extreme - that of very close imitation - we have the idea of whole brain emulation
- At the other extreme are approaches that take their inspiration from the functioning of the brain but do not attempt low-level imitation
Since there is a limited number - perhaps a very small number - of distinct fundamental mechanisms that operate in the brain, continuing incremental progress in brain science should eventually discover them all
An artificial intelligence need not much resemble a human mind. AIs could be - indeed, it is likely that most will be - extremely alien
We should expect that they will have very different cognitive architectures than biological intelligences, and in their early stages of development they will have very different profiles of cognitive strengths and weaknesses

In whole brain emulation (also known as "uploading"), intelligent software would be produced by scanning and closely modeling the computational structure of a biological brain
Achieving whole brain emulation requires the accomplishment of the following steps
- A sufficiently detailed scan of a particular human brain is created
- The raw data from the scanners is fed to a computer for automated image processing to reconstruct the three-dimensional neuronal network that implemented cognition in the original brain
- The neurocomputational structure resulting from the previous step is implemented on a sufficiently powerful computer
The whole brain emulation path does not require that we figure out how human cognition works or how to program an artificial intelligence
It requires only that we understand the low-level functional characteristics of the basic computational elements of the brain
Whole brain emulation does require some rather advanced enabling technologies. There are three key prerequisites:
- Scanning: high-throughput microscopy with sufficient resolution and detection of relevant properties
- Translation: automated image analysis to turn raw scanning data into an interpreted three-dimensional model of relevant neurocomputational elements
- Simulation: hardware powerful enough to implement the resultant computational structure
The aim is not to create a brain simulation so detailed and accurate that one could use it to predict exactly what would have happened in the original brain if it had been subjected to a particular sequence of stimuli. Instead, the aim is to capture enough of the computationally functional properties of the brain to enable the resultant emulation to perform intellectual work
Knowing merely which neurons are connected with which is not enough. To create a brain emulation one would also need to know which synapses are excitatory and which are inhibitory; the strength of the connections; and various dynamical properties of the axons, synapses, and dendritic trees

A third path to greater-than-current-human intelligence is to enhance the function of biological brains
In principle, this could be achieved without technology, through selective breeding
It seems implausible, on both neurological and evolutionary grounds, that one could by introducing some chemical into the brain of a healthy person spark a dramatic rise in intelligence
Manipulation of genetics will provide a more powerful set of tools than psychopharmacology
The problem with sequential selection, of course, is that it takes longer. If each generational step takes twenty or thirty years, then even just five successful generations would push us well into the twenty-second century
One intervention that becomes possible when human genomes can be synthesized is genetic "spell-checking" of an embryo
Each of us currently carries a mutational load, with perhaps hundreds of mutations that reduce the efficiency of various cellular processes
With gene synthesis we could take the genome of an embryo and construct a version of that genome free from the genetic noise of accumulated mutations
Three conclusions
- at least weak forms of superintelligence are achievable by means of biotechnological enhancements
- the feasibility of cognitively enhanced humans adds to the plausibility that advanced forms of machine intelligence are feasible
- when we consider scenarios stretching significantly into the second half of this century and beyond, we must take into account the probable emergence of a generation of genetically enhanced populations, with the magnitude of enhancement escalating rapidly over subsequent decades

Although the possibility of direct connections between humans brains and computers has been demonstrated, it seems unlikely that such interfaces will be widely used as enhancements any time soon
To begin with, there are significant risks of medical complications - infections, electrode displacement, hemorrhage, and cognitive decline - when implanting electrodes in the brain
The second reason to doubt superintelligence will be achieved through cyborgization is that enhancement is likely to be far more difficult than therapy
Even if there were an easy way of pumping more information into our brains, the extra data inflow would do little to increase the rate at which we think and learn unless all the neural machinery necessary for making sense of the data were similarly upgraded
Keeping our machines outside of our bodies also makes upgrading easier
The rate-limiting step in human intelligence is not how fast raw data can be fed into the brain but rather how quickly the brain can extract meaning and make sense of the data
Perhaps it will be suggested that we transmit meanings directly, rather than package them into sensory data that must be decoded by the recipient. There are two problems with this
- Brains, by contrast to the kinds of program we typically run on our computers, do not use standardized data storage and representation formats. Rather, each brain develops its own idiosyncratic representations of higher-level content
- Creating the required interface (to read/write from billions of individually addressable neurons) is probably an AI-complete problem
One hope for the cyborg route is that the brain, if permanently implanted with a device connecting it to some external resource, would over time learn an effective mapping between its own internal cognitive states and the inputs it receives from, or the outputs accepted by, the device
- Then the implant itself would not need to be intelligent; rather, the brain would intelligently adapt to the interface, much as the brain of an infant gradually learns to interpret the signals arriving from receptors in its eyes and ears

The idea here is not that this would enhance the intellectual capacity of individuals enough to make them superintelligent, but rather that some system composed of individuals thus networked and organized might attain a form of superintelligence
In general terms, a system's collective intelligence is limited by the abilities of its member minds, the overheads in communicating relevant information between them, and the various distortions and inefficiencies that pervade human organizations
If communication overheads are reduced (including not only equipment costs but also response latencies, time and attention burdens, and other factors), then larger and more densely connected organizations become feasible
Could the Internet become something more than just the backbone of a loosely integrated collective superintelligence - something more like a virtual skull housing an emerging unified super-intellect

We use the term "superintelligence" to refer to intellects that greatly outperform the best current human minds across many very general cognitive domains
We will differentiate between three forms:
- speed superintelligence
- collective superintelligence
- quality superintelligence

A speed superintelligence is an intellect that is just like a human mind but faster
Speed superintelligence: A system that can do all that a human intellect can do, but much faster
An emulation operating at a speed of ten thousand times that of a biological brain would be able to read a book in a few seconds and write a PhD thesis in a afternoon. With a speedup factor of a million, an emulation could accomplish an entire millenium of intellectual work in one working day
Suppose your mind ran at 10000x
- Because of this apparent time dilation of the material world, a speed superintelligence would prefer to work with digital objects
- Alternatively, it could interact with the physical environment by means of nanoscale manipulators, since limbs at such small scales could operate faster than macroscopic appendages
The speed of light becomes an increasingly important constraint as minds get faster, since faster minds face greater opportunity costs in the use of their time for traveling or communicating over long distances

Collective superintelligence: A system composed of a large number of smaller intellects such that the system's overall performance across many very general domains vastly outstrips that of any current cognitive system
Collective intelligence excels at solving problems that can be readily broken into parts such that solutions to sub-problems can be pursued in parallel and verified independently
A system's collective intelligence could be enhanced by expanding the number or the quality of its constituent intellects, or by improving the quality of their organization
The definition does not even imply that the more collectively intelligent society is wiser
We can think of wisdom as the ability to get the important things approximately right
We should recognize that there can exist instrumentally powerful information processing systems - intelligent systems - that are neither inherently good nor reliably wise
A collective superintelligence could, after gaining sufficiently in integration, become a "quality superintelligence"

Quality superintelligence: A system that is at least as fast as a human mind and vastly qualitatively smarter
We can expand the range of our reference points by considering nonhuman animals, which have intelligence of lower quality
Nonhuman animals lack complex structured language; they are capable of no or only rudimentary tool use and tool construction; they are severely restricted in their ability to make long-term plans; and they have very limited abstract reasoning ability
The concept of quality superintelligence: it is intelligence of quality at least as superior to that of human intelligence as the quality of human intelligence is superior to that of elephants', dolphins', or chimpanzees'
A second way to illustrate the concept of quality superintelligence is by noting the domain-specific cognitive deficits that can afflict individual humans, particularly deficits that are not caused by general dementia or other conditions associated with wholesale destruction of the brain's neurocomputational resources
Such examples show that normal human adults have a range of remarkable cognitive talents that are not simply a function of possessing a sufficient amount of general neural processing power or even a sufficient amount of general intelligence: specialized neural circuitry is also needed

We might say that speed superintelligence excels at tasks requiring the rapid execution of a long series of steps that must be performed sequentially while collective superintelligence excels at tasks admitting of analytic decomposition into parallelizable sub-tasks and tasks demanding the combination of many different perspectives and skill sets
In some domains, quantity is a poor substitute for quality
If we widen our purview to include superintelligent minds, we must countenance a likelihood of there being intellectual problems solvable only by superintelligence and intractable to any ever-so-large collective of non-augmented humans
We cannot clearly see what all these problems are, but we can characterize them in general terms. They would tend to be problems involving multiple complex interdependencies that do not permit of independently verifiable solution steps: problems that therefore cannot be in piecemeal fashion, and that might require qualitatively new kinds of understanding or new representational frameworks that are too deep or too complicated for the current edition of mortals to discover or use effectively

Hardware advantages
- Speed of computational elements
- Internal communication speed
- Number of computational elements
- Storage capacity
- Reliability, lifespan, sensors, etc.
Software advantages
- Editability
- Duplicability
- Goal coordination
- Memory sharing
- New modules, modalities, and algorithms

If and when such machine is developed, how long will it be from then until a machine becomes radically superintelligent?
We can distinguish three classes of transition scenarios, based on their steepness
- Slow: A slow takeoff is one that occurs over some long temporal interval, such as decades or centuries
- Fast: A fast takeoff occurs over some short temporal interval, such as minutes, hours, or days
- Moderate: A moderate takeoff is one that occurs over some intermediary temporal interval, such as months or years
We can conceive the rate of increase in a system's intelligence as a (monotonically increasing) function of two variables: the amount of "optimization power", or quality-weighted design effort, that is being applied to increase the system's intelligence, and the responsiveness of the system to the application of a given amount of such optimization power
$$ \text{Rate of change in intelligence} = \frac{\text{Optimization power}}{\text{Recalcitrance}}$$
A system's recalcitrance might also vary depending on how much the system has already been optimized
- Often, the easiest improvements are made first, leading to diminishing returns as low-hanging fruits are depleted

The path toward artificial intelligence may feature no obvious milestone or early observation point. It is entirely possible that the quest for artificial intelligence will appear to be lost in dense jungle until an unexpected breakthrough reveals the finishing line in a clearing just a few short steps away
It is quite possible that recalcitrance falls when a machine reaches human parity
Suppose an AI is composed of two subsystems, one possessing domain-specific problem-solving techniques, the other possessing general-purpose reasoning ability. It could be the case that while the second subsystem remains below a certain capacity threshold, it contributes nothing to the system's overall performance, because the solutions it generates are always inferior to those generated by the domain-specific subsystem. Suppose now that a small amount of optimization power is applied to the general-purpose subsystem and that this produces a brisk rise in the capacity of that subsystem. At first, we observe no increase in the overall system's performance, indicating that recalcitrance is high. Then, once the capacity of the general-purpose subsystem crosses the threshold where its solutions start to beat those of the domain-specific subsystem, the overall system's performance suddenly beings to improve at the same brisk pace as the general-purpose subsystem, even as the amount of optimization power applied stays constant: the system's recalcitrance has plummeted
It is also possible that our natural tendency to view intelligence from an anthropocentric perspective will lead us to underestimate improvements in sub-human systems, and thus to overestimate recalcitrance
A system's intellectual problem-solving capacity can be enhanced not only by making the system cleverer but also by expanding what the system knows
In order to tap the full potential of fast content accumulation, however, a system needs to have a correspondingly large memory capacity. There is little point in reading an entire library if you have forgotten all about the aardvark by the time you get to the abalone

Two phases
- The first phase begins with the onset of the takeoff, when the system reaches the human baseline for individual intelligence. Most of the optimization power applied to the system still comes from outside the system, either from the work of programmers or engineers
- A second phase will begin if at some point the system has acquired so much capability that most of the optimization power exerted on it comes from the system itself

If the takeoff is fast then it is unlikely that two independent projects would be taking off concurrently: almost certainly, the first project would have completed its takeoff before any other project would have started its own
If the takeoff is slow then there could plausibly be multiple projects undergoing takeoffs concurrently, so that although the projects would by the end of the transition have gained enormously in capability, there would be no time at which any project was far enough ahead of the others to give it an overwhelming lead
If a project did obtain a decisive strategic advantage, would it use it to suppress competitors and form a singleton (a world order in which there is at the global level a single decision-making agency)?

One factor influencing the width of the gap between frontrunner and followers is the rate of diffusion of whatever it is that gives the leader a competitive advantage
- A frontrunner might find it difficult to gain and maintain a large lead if followers can easily copy the frontrunner's ideas and innovations
The mere demonstration of the feasibility of an invention can also encourage others to develop it independently

The likelihood of the final breakthrough being made by a small project increases if most previous progress in the field has been published in the open literature or made available as open source software

Projects designed from the outset to be secret could be more difficult to detect. An ordinary software development project could serve as a front

A country that believed it could achieve a breakthrough unilaterally might be tempted to do it alone rather than subordinate its efforts to a joint project. A country might refrain from joining an international collaboration from fear that other participants might siphon off collaboratively generated insights and use them to accelerate a covert national project

Many factors might dissuade a human organization with a decisive strategic advantage from creating a singleton. These include non-aggregative or bounded utility functions, non-maximizing decision rules, confusion and uncertainty, coordination problems, and various costs associated with a takeover
Human individuals and human organizations typically have preferences over resources that are not well represented by an "unbounded aggregative utility function." A human will typically not wager all her capital for a fifty-fifty chance of doubling it
Humans and human-run organizations may also operate with decision processes that do not seek to maximize expected utility

The most essential characteristic of a seed AI, aside from being easy to improve (having low recalcitrance), is being good at exerting optimization power to amplify a system's intelligence

How could a superintelligence achieve the goal of world domination?
- Pre-criticality phase
- Recursive self-improvement phase
- Covert preparation phase
- Overt implementation phase

An agent's ability to shape humanity's future depends not only on the absolute magnitude of the agent's own faculties and resources, but also on the relative magnitude of its capabilities compared with those of other agents with conflicting goals
In a situation where there are no competing agents, the absolute capability level of a superintelligence, so long as it exceeds a certain minimal threshold, does not matter much, because a system starting out with some sufficient set of capabilities could plot a course of development that will let it acquire any capabilities it initially lacks
The wise-singleton sustainability threshold: A capability set exceeds the wise-singleton threshold if and only if a patient and existential risk-savvy system with that capability set would, if it faced no intelligent opposition or competition, be able to colonize and re-engineer a large part of the accessible universe

Intelligence and final goals are independent variables: any level of intelligence could be combined with any final goal

Intelligence and motivation are in a sense orthogonal: we can think of them as two axes spanning a graph in which each point represents a logically possible artificial agent
The orthogonality thesis: Intelligence and final goals are orthogonal: more or less any level of intelligence could in principle be combined with more or less any final goal
There are at least three directions from which we can approach the problem of predicting superintelligent motivation:
- Predictability through design
- Predictability through inheritance
- Predictability through convergent instrumental reasons

There are likely some instrumental goals likely to be pursued by almost any intelligent agent, because there are some objectives that are useful intermediaries to the achievement of almost any final goal
The instrumental convergence thesis: Several instruments values can be identified which are convergent in the sense that their attainment would increase the chances of the agent's goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by a broad spectrum of situated intelligent agents

If an agent retains its present goals into the future, then its present goals will be more likely to be achieved by its future self
There are situations in which an agent can best fulfill its final goals by intentionally changing them. Such situations can arise when any of the following factors is significant:
- Social signaling
- Social preferences
- Preferences concerning own goal content
- Storage costs

Improvements in rationality and intelligence will tend to improve an agent's decision-making, rendering the agent more likely to achieve its final goals
One would therefore expect cognitive enhancement to emerge as an instrumental goal for a wide variety of intelligent agents
For similar reasons, agents will tend to instrumentally value many kinds of information
An agent that has access to reliable expert advice may have little need for its own intelligence and knowledge
If intelligence and knowledge come at a cost, such as time and effort expended in acquisition, or increased storage or processing requirements, then the agent might prefer less knowledge and less intelligence

A software agent might place an instrumental value on more efficient algorithms that enable its mental functions to run faster on given hardware

Human beings tend to seek to acquire resources sufficient to meet their basic need
But people usually seek to acquire resources far beyond this minimum level
A great deal of resource accumulation is motivated by social concerns - gaining status, mates, friends, and influence, through wealth accumulation and conspicuous consumption

Absent of a special effort, the first superintelligence may have a random or reductionist final goal
The treacherous turn: While weak, an AI behaves cooperatively (increasingly so, as it gets smarter). When the AI gets sufficiently strong - without warning or provocation - it strikes, forms a singleton, and begins directly to optimize the world according to the criteria implied by its final values
An AI might calculate that if it is terminated, the programmers who built it will develop a new and somewhat different architecture, but one that will be given a similar utility function. In this case, the original AI may be indifferent to its own demise, knowing that its goals will continue to be pursued in the future

There are ways of failing that we might term "malignant" in that they involve an existential catastrophe
One feature of a malignant failure is that it eliminates the opportunity to try again
Another feature of malignant failure is that it presupposes a great deal of success: only a project that got a great number of things right could succeed in building a machine intelligence powerful enough to pose a risk of malignant failure

A superintelligence discovering some way of satisfying the criteria of its final goal that violates the intentions of the programmers who defined the goal

The phenomenon where an agent transforms large parts of the reachable universe into infrastructure in the service of some goal, with the side effect of preventing the realization of humanity's axiological potential
Unless the AI's motivation system is of a special kind, or there are additional elements in its final goal that penalize strategies that have excessively wide-ranging impacts on the world, there is no reason for the AI to cease activity upon achieving its goal. On the contrary: if the AI is a sensible Bayesian agent, it would never assign exactly zero probability to the hypothesis that it has not yet achieved its goal
The claim here is that there is no possible way to avoid this failure mode
Might we avoid this malignant outcome if instead of a maximizing agent we build a satisficing agent, one that simply seeks to achieve an outcome that is "good enough" according to some criterion, rather than an outcome that is as good as possible?

In mind crime, the side effect is not external to the AI; rather, it concerns what happens within the AI itself (or within the computational processes it generates)
A machine superintelligence could create internal processes that have moral status
- One can imagine scenarios in which an AI creates trillions of such conscious simulations, perhaps in order to improve its understanding of human psychology and sociology. These simulations might be placed in simulated environments and subjected to various stimuli, and their reactions studied. Once their informational usefulness has been exhausted, they might be destroyed

Could we engineer the initial conditions of an intelligence explosion so as to achieve a specific desired outcome, or at least to ensure that the result lies somewhere in the class of broadly acceptable outcomes?
- How can the sponsor of a project that aims to develop superintelligence ensure that the project, if successful, produces a superintelligence that would realize the sponsor's goal?
The first principal-agent problem: Whenever some human entity ("the principal") appoints another ("the agent") to act in the former's interest (Human vs Human, Sponsor -> Developer)
The second principal-agent problem (the control problem): In this case, the agent is not a human agent operating on behalf of a human principal. Instead, the agent is the superintelligent system (Human vs Superintelligence, Project -> System)

Capability control methods seek to prevent undesirable outcomes by limiting what the superintelligence can do

Physical containment aims to confine the system to a "box," i.e. to prevent the system from interacting with the external world otherwise than via specific restricted output channels
For extra security, the system should be placed in a metal mesh to prevent it from transmitting radio signals, which might otherwise offer a means of manipulating electronic objects such as radio receivers in the environment
Physical containment has several advantages
- It is easy to implement
- It can be applied to many machine intelligence architectures, even ones that were not initially designed with safety as an objective
- It can be used in combination with most other control methods
- It seems unlikely to go wrong by backfiring: while it might fail to ensure safety, it is unlikely to cause a catastrophe that would not otherwise have occurred
The main disadvantage with physical confinement is that it reduces the functionality of the superintelligence
Another concern is that it might encourage a false sense of security, though this is avoidable if we regard physical confinement as icing on the cake rather than the main substance of our precautions
Informational containment aims to restrict what information is allowed to exit the box
An obvious informational containment method is to bar the system from accessing communications networks
The limiting case of the boxing approach would be a system kept in complete physical and informational isolation
Even if achievable, however, such a system would be rather useless since it would have no effect on the external world
It might perhaps be thought that some scientific purpose could be served by creating a superintelligence and keeping it in isolation: by studying a self-contained model system, one could learn about its internal dynamics and its behavior patterns
- But this would be an error. As soon as the system is observed, it ceases to be informationally isolated
An AI anticipating that it might be observed could strategically adopt behaviors designed to influence the hypothesized observers

Incentive methods involve placing an agent in an environment where it finds instrumental reasons to act in ways that promote the principal's interests
It presupposes a balance of power: legal or economic sanctions cannot restrain an agent that has a decisive strategic advantage
By relying on social integration to solve the control problem, the principal risks sacrificing a large portion of his potential influence
A better alternative might be to combine the incentive method with the use of motivation selection to give the AI a final goal that makes it easier to control
A problem with the incentive scheme is that it presupposes that we can tell whether the outcomes produced by the AI are in our interest

Limit the system's intellectual faculties or its access to information
Even without any designated knowledge base at all, a sufficiently superior mind might be able to learn much by simply introspecting on the workings of its own psyche - the design choices reflected in its source code, the physical characteristics of its circuitry

A tripwire is a mechanism that performs diagnostic tests on the system (possibly without its knowledge) and effects a shutdown if it detects signs of dangerous activity
Tripwires differ from incentive methods in that they do not rely on the system being aware of the consequences of engaging in forbidden activities

Motivation selection methods seek to prevent undesirable outcomes by shaping what the superintelligence wants to do
Motivation selection can involve explicitly formulating a goal or set of rules to be followed (direct specification) or setting up the system so that it can discover an appropriate set of values for itself by reference to some implicitly or indirectly formulated criterion (indirect normativity)
One option in motivation selection is to try to build the system so that it would have modest, non-ambitious goals (domesticity)
An alternative to creating a motivation system from scratch is to select an agent that already has an acceptable motivation system and then augment that agent's cognitive powers to make it superintelligent, while ensuring that the motivation system does not get corrupted in the process (augmentation)

The approach comes in two versions, rule-based and consequentialist
Involves trying to explicitly define a set of rules or values that will cause even a free-roaming superintelligent AI to act safely and beneficially
Difficulties in determining which rules or values we would wish the AI to be guided by and the difficulties in expressing those rules or values in computer-readable code
The traditional illustration of the direct rule-based approach is the "three laws of robotics"
- A robot may not injure a human being or, through inaction, allow a human being to come to harm
- A robot must obey any order given to it by human beings, except where such orders would conflict with the First Law
- A robot must protect its own existence as long as such protection does not conflict with the First or Second Law
Everything is vague to a degree you do not realize till you have tried to make it precise -Bertrand Russell

One could try to design an AI such that it would function as a question-answering device

The basic idea is that rather than specifying a concrete normative standard directly, we specify a process for deriving a standard. We then build the system so that it is motivated to carry out this process and to adopt whatever standard the process arrives at
Indirect normativity is a very important approach to motivation selection. Its promise lies in the fact that it could let us offload to the superintelligence much of the difficult cognitive work required to carry out a direct specification of an appropriate final goal

The idea is that rather than attempting to design a motivation system de novo, we start with a system that already has an acceptable motivation system, and enhance its cognitive faculties to make it superintelligent
The attractiveness of augmentation may increase in proportion to our despair at the other approaches to the control problem

An oracle is a question-answering system
It might accept questions in a natural language and present its answers as text
Even an untrustworthy oracle could be useful
- We could ask an oracle questions of a type for which it is difficult to find the answer but easy to verify whether a given answer is correct
- If it is expensive to verify answers, we can randomly select a subset of the oracle's answers for verification. If they are all correct, we can assign a high probability to most of the other answers also being correct

A genie is a command-executing system: it receives a high-level command, carries it out, then pauses to await the next command
A sovereign is a system that has an open-ended mandate to operate in the world in pursuit of broad and possibly very long-range objectives
The ideal genie would be a super-butler rather than an autistic savant
One option would be to try to build a genie such that it would automatically present the user with a prediction about salient aspects of the likely outcomes of a proposed command, asking for confirmation before proceeding
The difference between oracles, genies and sovereigns comes down to alternative approaches to the control problem
Safety order: sovereigns < genies < oracles
There are risks involved in designing a superintelligence to have a final goal that does not fully match the outcome that we ultimately seek to attain

Why build a superintelligence that has a will of its own?
Instead of creating an AI that has beliefs and desires and that acts like an artificial person, we should aim to build regular software that simply does what it is programmed to do
It is not straightforward to do if the product being created is a powerful general intelligence
It might be thought that by expanding the range of tasks done by ordinary software, one could eliminate the need for artificial general intelligence. But the range and diversity of tasks that a general intelligence could profitably perform in a modern economy is enormous
Especially relevant for our purposes is the task of software development itself
- There would be enormous practical advantages to being able to automate this
The classical way of writing software requires the programmer to understand the task to be performed in sufficient detail to formulate an explicit solution process consisting of a sequence of mathematically well-defined steps expressible in code
- This approach works for solving well-understood tasks, and is to credit for most software that is currently in use
- It falls short, however, when nobody knows precisely how to solve all of the tasks that need to be accomplished
There are (at least) two places where trouble could then arise
- First, the superintelligence search process might find a solution that is not just unexpected but radically unintended
- Second, in the course of the software's operation. If the methods that the software uses to search for a solution are sufficiently sophisticated, they may include provisions for managing the search process itself in an intelligent manner

In singleton scenarios, what happens post-transition depends almost entirely on the values of the singleton

Market wages fall
The only place where humans would remain competitive may be where customers have a basic preference for work done by humans

The income share received by labor would dwindle to practically nil
The factor share of capital would become nearly 100% of total world product
It follows that the total income from capital would increase enormously

Most of this section seem to have a very anthropomorphic viewpoint
Conveying the impression to other members of the social group of being in flourishing condition - in good health, in good standing with one's peers, and in confident expectation of continued good fortune - may have boosted an individual's popularity
A bias toward cheerfulness could thus have been selected for, with the result that human neurochemistry is now biased toward positive affect compared to what would have been maximally efficient according to simpler materialistic criteria

Perhaps a more advanced motivation system would be based on an explicit representation of a utility function or some other architecture that has no exact functional analogs to pleasure and pain
It is conceivable that optimal efficiency would be attained by grouping capabilities in aggregates that roughly match the cognitive architecture of a human mind

Many of the other examples of humanistic may have evolved as hard-to-fake signals of qualities that are difficult to observe directly, such as bodily or mental resilience, social status, quality of allies, ability and willingness to prevail in a fight, or possession of resources

Carl Shulman has argued that in a population of emulations, selection pressures would favor the emergence of "superorganisms," groups of emulations ready to sacrifice themselves for the good of their clan
The essential property of a superorganism is not that it consists of copies of a single progenitor but that all the individual agents within it are fully committed to a common goal
- The ability to create a superorganism can thus be viewed as requiring a partial solution to the control problem
One area in which superorganisms (or other digital agents with partially selected motivations) might excel is coercion

A motivation system cannot be specified as a comprehensive lookup table. It must instead be expressed more abstractly, as a formula or rule that allows the agent to decide what to do in any given situation
One formal way of specifying such decision rule is via a utility function
A utility function assigns value to each outcome one might obtain, or more generally to each "possible world."
Given a utility function, one can define an agent that maximizes expected utility. Such an agent selects at each time the action that has the highest expected utility
Identifying and codifying our own final goals is difficult because human goal representations are complex
How could a programmer transfer this complexity into a utility function?
- One approach would be to try to directly code a complete representation of whatever goal we have that we want the Ai to pursue; in other words, to write an explicit utility function. This approach might work if we had extraordinarily simple goals

Evolution has produced an organism with human values at least once
Evolution can be viewed as a particular class of search algorithms that involve the alternation of two steps, one expanding a population of solution candidates by generating new candidates according to some relatively simple stochastic rule (such as random mutation or sexual recombination), the other contracting the population by pruning candidates that score poorly when tested by an evaluation function

Reinforcement learning is an area of machine learning that studies techniques whereby agents can learn to maximize some notion of cumulative reward

How do we ourselves manage to acquire our values?
We begin life with some relatively simple starting preferences (e.g., an aversion to noxious stimuli) together with a set of dispositions to acquire additional preferences in response to various possible experiences
Both the simple starting preferences and the dispositions are innate, having been shaped by natural and sexual selection over evolutionary timescales
DNA contains instructions for building a brain, which, when placed in a typical human environment, will over the course of several years develop a world model that includes concepts of persons and of well-being. Once formed, these concepts can be used to represent certain meaningful values
Instead of specifying complex values directly, could we specify some mechanism that leads to the acquisition of those values when the AI interacts with a suitable environment?

Involves giving the seed AI an interim goal system, with relatively simple final goals that we can represent by means of explicit coding or some other feasible method. Once the AI has developed more sophisticated representational faculties, we replace this interim scaffold goal system with one that has different final goals
One could also try to use motivation selection methods to induce a more collaborative relationship between the seed AI and the programmer team
One could even imagine endowing the seed AI with the sole final goal of replacing itself with a different final goal, one which may have been only implicitly or indirectly specified by the programmers
The motivational scaffold approach is not without downsides. One is that it carries the risk that the AI could become too powerful while it is still running on its interim goal system
Another downside is that installing the ultimately intended goals in a human-level AI is not necessarily that much easier than doing so in a more primitive AI

Involves using the AI's intelligence to learn the values we want it to pursue
We must provide a criterion for the AI that at least implicitly picks out some suitable set of values
Learning does not change the goal. It changes only the AI's beliefs about the goal
One outstanding issue is how to endow the AI with a goal such as "Maximize the realization of the values described in the envelope"
- To do this, it is necessary to identify the place where the values are described
What we might call the "Hail Mary" approach is based on the hope that elsewhere in the universe there exists (or will come to exist) civilizations that successfully manage the intelligence explosion, and that they end up with values that significantly overlap with our own
Another idea for how to solve the value-loading problem has recently been proposed by Paul Christiano
- Suppose we could obtain (a) a mathematically precise specification of a particular human brain and (b) a mathematically well-specified virtual environment that contains an idealized computer with an arbitrarily large amount of memory and CPU power
- Given (a) and (b), we could define a utility function U as the output the human brain would produce after interacting with this environment
- U could serve as the value criterion for a value learning Ai, which could use various heuristics for assigning probabilities to hypotheses about what U implies
- Christiano observes that in order to obtain a mathematically well-specified value criterion, we do not need a pratically useful computational model of a mind, a model we could run. We just need a (possibly implicit and hopelessly complicated) mathematical definition - and this may be much easier to attain

One could argue that whole brain emulation research is less likely to involve moral violations than artificial intelligence research, on the grounds that we are more likely to recognize when an emulation mind qualifies for moral status than we are to recognize when a completely alien or synthetic mind does so

Some intelligent systems consist of intelligent parts that are themselves capable of agency
The motivations of such composite systems depend not only on the motivations of their constituent subagents but also on how those subagents are organized
Human-level subagents have the ability to strategize and might thus choose to conceal certain goals while their behavior was being monitored

Suppose that we had solved the control problem so that we were able to load any value we chose into the motivation system of a superintelligence, making it pursue that value as its final goal. Which value should we install?
To select a final value based on our current convictions, in a way that locks it in forever and precludes any possibility of further ethical progress, would be to risk an existential moral calamity
Instead of making a guess based on our own current understanding, we would delegate some of the cognitive work required for value selection to the superintelligence

Our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted

Encapsulate moral growth
Avoid hijacking the destiny of humankind
Avoid creating a motive for modern-day humans to fight over the initial dynamics
Keep humankind ultimately in charge of its own destiny

Instead of implementing humanity's coherent extrapolated volition, one could try to build an AI with the goal of doing what is morally right, relying on the AI's superior cognitive capacities to figure out just which actions fit that description
Moral rightness (MR) appears to have several advantages over CEV
- MR would do away with various free parameters in CEV
- It would seem to eliminate the possibility of a moral failure resulting from the use of an extrapolation base that is too narrow or too wide
- MR would orient the AI toward morally right action even if our coherent extrapolated volitions happen to wish for the AI to take actions that are morally odious
MR would also appear to have some disadvantages
- It relies on the notion of "morally right," a notoriously difficult concept
- A more fundamental issue with MR is that even if it can be implemented, it might not give us what we want or what we would choose if we were brighter and better informed

If we knew how to code "Do What I Mean" in a general and powerful way, we might as well use that as a standalone goal

Goal content
- What objective should the AI pursue?
- How should a description of this objective be interpreted?
- Should the objective include giving special rewards to those who contributed to the project's success?
Decision theory
- Should the AI use causal decision theory, evidential decision theory, updateless decision theory, or something else?
Epistemology
- What should the AI's prior probability function be, and what other explicit or implicit assumptions about the world should it make?
- What theory of anthropics should it use?
Ratification
- Should the AI's plans be subjected to human review before being put into effect?
- If so, what is the protocol for that review process?

In general, it seems wise to aim at minimizing the risk of catastrophic error rather than at maximizing the chance of every detail being fully optimized
Two reasons
- Humanity's cosmic endowment is astronomically large
- There is a hope that if we but get the initial conditions for the intelligence explosion approximately right, then the resulting superintelligence may eventually home in on, and precisely hit, our ultimate objectives
It is not necessary for us to create a highly optimized design. Rather, our focus should be on creating a highly reliable design, one that can be trusted to retain enough sanity to recognize its own failings. An imperfect superintelligence, whose fundamentals are sound, would gradually repair itself; and having done so, it would exert as much beneficial optimization power on the world as if it had been perfect from the outset

Technological completion conjecture: If scientific and technological development efforts do not effectively cease, then all important basic capabilities that could be obtained through some possible technology will be obtained
The principle of differential technological development: Retard the development of dangerous and harmful technologies, especially ones that raise the level of existential risk; and accelerate the development of beneficial technologies, especially those that reduce the existential risks posed by nature or by other technologies

What matters is that we get superintelligence before other dangerous technologies, such as advanced nanotechnology

An increase in either the mean or the upper range of human intellectual ability would likely accelerate technological progress across the board, including progress toward various forms of machine intelligence, progress on the control problem, and progress on a wide swath of other technical and economical objectives
Consider the limiting case of a "universal accelerator," an imaginary intervention that accelerates literally everything
- The action of such a universal accelerator would correspond merely to an arbitrary rescaling of the time metric, producing no qualitative change in observed outcomes
Macro-structural development accelerator: A lever that accelerates the rate at which macro-structural features of the human condition develop, while leaving unchanged the rate at which micro-level human affairs unfold
Two kinds of existential risk:
- State risks: associated being in a certain state, and the total amount of state risk to which a system is exposed is a direct function of how long the system remains in that state
- Step risks: a discrete risk associated with some necessary or desirable transition. Once the transcription is completed, the risk vanishes
We can then say the following regarding a hypothetical macro-structural development accelerator:
- We should favor acceleration - provided we think we have a realistic prospect of making it through to a post-transition era in which any further existential risks are greatly reduced
- If it were known that there is some step ahead destined to cause an existential catastrophe, then we ought to reduce the rate of macro-structural development in order to give more generations a chance to exist before the curtain is rung down

Refers to a condition in which two technologies have a predictable timing relationship, such that developing one of the technologies has a robust tendency to lead to the development of the other, either as a necessary precursor or as an obvious and irresistible application or subsequent step
It is not good accelerating the development of a desirable technology Y if the only way of getting Y is by developing an extremely undesirable precursor technology X, or if getting Y would immediately produce an extremely undesirable related technology Z

Faster computers make it easier to create machine intelligence
Hardware can to some extent substitute for software; thus, better hardware reduces the minimum skill required to code a seed AI
Fast computers might also encourage the use of approaches that rely more heavily on brute-force techniques and less on techniques that require deep understanding to use
Rapid hardware progress increases the likelihood of a fast takeoff
It seems difficult to have much leverage on the rate of hardware advancement

There are at least three putative advantages of whole brain emulation
- Its performance characteristics would be better understood than those of AI
- It would inherit human motives
- It would result in a slower takeoff
On balance, it looks like the risk of the AI transition would be reduced if WBE comes before AI. However, when we combine the residual risk in the AI transition with the risk of an antecedent WBE transition, it becomes very unclear how the total existential risk along the WBE-first path stacks up against the risk along the AI-first path

A race dynamic exists when one project fears being overtaken by another
The race dynamic could spur projects to move faster toward superintelligence while reducing investment in solving the control problem

It reduces the haste in developing machine intelligence
It allows for greater investment in safety
It avoids violent conflicts
It facilitates the sharing of ideas about how to solve the control problem
It tends to produce outcomes in which the fruits of a successfully controlled intelligence explosion get distributed more equitably
In general, greater post-transition collaboration appears desirable
- It would reduce the risk of dystopian dynamics in which economic competition and a rapidly expanding population lead to a Malthusian condition, or in which evolutionary selection erodes human values and selects for non-eudaemonic forms, or in which rival powers suffer each other coordination failures such as wars and technology races

The ideal form of collaboration for the present may therefore be one that does not initially require specific formalized agreements and that does not expedite advances in machine intelligence
The common good principle: Superintelligence should be developed only for the benefit of all of humanity and in the service of widely shared ethical ideals

The question is not whether the result discovered by the Fields Medalist is in itself "important". Rather, the question is whether it was important that the medalist enabled the publication of the result to occur at an earlier date. The value of this temporal transport should be compared to the value that a world-class mathematical mind could have generated by working on something else
In some cases, the Fields Medal might indicate a life spent solving the wrong problem - for instance, a problem whose allure consisted in being famously difficult to solve
The outlook now suggests that philosophic progress can be maximized via an indirect path rather than immediate philosophizing

To limit the risk of doing something actively harmful or morally wrong, we should prefer to work on problems that seem robustly positive-value (i.e., whose solution would make a positive contribution across a wide range of scenarios) and to employ means that are robustly justifiable (i.e., acceptable from a wide range of moral views)
We want to work on problems that are elastic to our efforts at solving them
- Highly elastic problems are those that can be solved much faster, or solved to a much greater extent, given one extra unit of effort
To reduce the risks of the machine intelligence revolution, we will propose two objectives that appear to best meet all those desiredata:
- Strategic analysis
- Capacity-building

What we mean by "strategic analysis" here is a search for crucial considerations: ideas or arguments with the potential to change our views not merely about the fine-structure or implementation but about the general topology of desirability

One important variable is the quality of the "social epistemology" of the AI-field and its leading projects. Discovering crucial considerations is valuable, but only if it affects action

Bostrom, Nick. Superintelligence: Paths, dangers, strategies. OUP Oxford, 2014.