If AI alignment is too slow, good box design may help us not die.

If you haven’t read much before about artificial general intelligence, this post is a good introduction and this post is a good response to it.

With the design of artificial general intelligence (AGI), there is the worry that for most seemingly useful goals, an AGI will predictably do things we don’t intend it to. Humans aren’t great at deeply understanding and specifying their goals and AGI will be smarter than us at finding ways to “technically” solve our problems.

One of the type of problems hypothesized here is with uncertainty: if an AGI is not certain it has accomplished its goal it has no reason to stop. People use the story of the Sorcerer’s Apprentice as an example: once the apprentice uses magic to command a broom to fill a cauldron with water, it keeps filling the cauldron after even after it is full. The apprentice did not specify his goal precisely enough, and once the process is in motion it is hard to stop.

Enough water? Status: Uncertain

In the book Superintelligence, one poor way to deal with this that is brought up is specifying limits: if you tell an AGI to make a specific number of goods, it may stop when you intend it to. The problem with this, as the book continues, is infrastructure profusion: if the AI is anything but 100% certain its goal has been achieved (which it can’t be), it makes sense to continue devoting resources to ensuring and verifying completion of the goal. If smart enough, the AGI could become dedicated to acquiring resources for things like computation and production capacity while building skills like manipulation and possibly even military capacity to ensure nothing can interfere with the completion of its goal.

A further idea for preventing this type of failure is to make the AGI lazy: to give it a probability window in which it should be okay with stopping: perhaps 95% certainty for some tasks. Bostrom’s objection here is that infrastructure profusion could still happen because infrastructure profusion may be the first idea to occur to an AGI. While this may be true, the point I’d like to make is that there is no longer as huge an incentive for infrastructure profusion with this solution, and removing that incentive is useful! Further constraints like time limits for task completion and efficiency requirements could further constrain the range of options for an AGI and may reduce instrumental convergence toward strategies we don’t want an AGI to pursue.1

Now such an AGI is limited, and will not necessarily be generally aligned with humans in many environments: but it will be constrained from being able to harm us in many ways and still be pretty useful.2

There will be strong incentives to have AGI with longer term planning, less laziness in finding solutions to problems, etc. but the restrained AGI does give people more control. Trying to reason from first principles/physics to design restraints on AGI to prevent it from gaining decisive strategic advantage against humans makes sense. An AGI with no goals outside its box won’t try to escape its box if doing so can’t possibly help its “inside-the-box” goals due to time and resource constraints which are a part of the AGI’s goal to comply with. If human value alignment with AGI turns out to be impossible3 we will need ways to get use from AGI without it ruining the world. There are other problems to solve, maybe such an AGI will find a way to short circuit its reward signals and make itself believe time hasn’t run out or have other false beliefs convenient for completion of its goals: but these sorts of problems can be solved in the same engineering sort of way as humans solve other problems. Yes AGIs will still search for the nearest unblocked strategy and there is still the risk of a treacherous turn, but it is a lower risk! Just because these partial safety solutions don’t get us to alignment, does not mean they won’t be useful along the way, since generally intelligent AGI may be here, and being used by companies and governments long before we figure out alignment. Even if you have an algorithm which will eventually converge on being aligned with humans, it would be nice for it to not do lots of harm in the mean time.4

So then, in terms of research time allocation, what is the best thing to do? Go about the engineering of safety via restraints, or trying to solve alignment? I am unsure, but given that different people have different goals, I expect it will be easier to get people to adopt specific constraints that clearly protect their own interests than to get people to not do anything and lose a lot of money and opportunity before alignment is solved. Basically, if we are pressed for time, you have to do box design.

Fortunately, it does look like MIRI has spent a significant amount of time investigating paths for controlling AGI that aren’t just making it have exactly the same values as humans. In this paper, they explore many ways of reducing AGI autonomy to increase human control, and ways to get its values to be less likely to lead to bad instrumental incentives. Though this is primarily focused on internal restraint via the AGI’s program than any external restraints, this makes sense since it is unlikely humans will out smart a superintelligence, and because changing software is easier than changing hardware. Likewise a better way to describe the idea I am trying to convey might be along the lines of programming AGI to “love its box.5”  Most restraint should be internal to the program, but this internal restraint allows us to successfully impose external restraint in areas the AGI might still do something bad if it had more power before we were able to train it properly.

For thousands of years humans have been building institutions which behave more intelligently than individual humans, but such institutions haven’t necessarily had the greater interests of humanity in mind. While with AGI you can start from a blank slate, and things could be much faster in terms of losing strategic advantage, we shouldn’t ignore the lessons that can be learned from other areas where people try to align the interests of groups to prevent new powerful groups from gaining decisive strategic advantage against others.


  1. In this way, for lots of specific tasks, you might be able to prevent even a superintelligent AGI from destroying everything while getting useful work done: by default it (and any sub-agents it makes) will stop in X time, with Y certainty level about goal completion, and humans have the opportunity to adjust the program again. These windows to adjust the goal of the AGI, where the AGI has no preference about its goals could be useful for getting information about what strategies it would have considered with more time: the AGI has no incentive to lie based on its original goals after its original goal timer runs out: you can just program it to give you as accurate a response as possible within a new set of constraints built also for safety. Return to article
  2. Provided you program it to do something useful. Return to article
  3. There are many senses in which alignment is impossible. Not everyone can get what they want and different humans have contradictory values. There will be winners and losers even if AI is very aligned with the interests of most humans. Alignment of AI goals and actions with human interests may also require aligning human interests with themselves, which are already hard to specify. Rather than defining human interests, some AI researchers think using inverse reinforcement learning may be a good way to gradually get AI to learn human values, via observing human behavior.  Return to article
  4. The exception is when you want to AI to mess up quickly in minor ways so you can train it to not do very bad things when it is more powerful. Return to article
  5. But not too much! 😉  Return to article

1 Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s