AI Safety is Team ∀

Foreword

Here, I'll give you a formalism for studying the logical structure of goals, and, using it to characterize the space of possibilities through which AI ruin may end up obtaining, demonstrate what exactly it is about the alignment problem that prevents us from solving it.

A/N: This was written for a popular audience, so it lacks the branching writing structures I usually use to explain my reasoning; as such, it's more evocative than explicative compared to my other writing. The main ideas presented herein form one part of a much more comprehensive framework for thinking about optimization in general, to be introduced in an as-yet-unfinished essay. This essay, Worldspace, will treat of the logical, physical, and computational structure of what I call here the "possibility space", demonstrating how this structure can be used to understand and predict the convergent behaviors of superintelligences.

Quantifying Predicates

In propositional (zeroth-order) logic, we work with propositions, or statements that are unambiguously either true or false. In predicate (first-order) logic, we work with abstract properties that, when applied to some particular object $x$, yield propositions whose truth value turns on whether $x$ has the property. If $P$ represents the property of being mortal—we can call it the mortality predicate—then the proposition $P(x)$ is true when $x$ refers to a living thing that happens to be mortal, and false otherwise.

Unlike a proposition, a predicate does not have a truth value in itself; nevertheless, we can wring one out of it by finding some way to quantify the way in which it holds across all $x$ of the correct type—in this case, $x$ can vary across the set $X$ of all living things. We call this method of turning predicates into propositions quantification. There are two common ways of quantifying the manner in which $P$ holds across $X$:

Let's talk about when and how this general notion of quantification—of making a generalized predicate into a concrete proposition by talking about what can be said about its space of possible instantiations—can be put to use in our thinking about the real world. To this end, let's characterize the differences between real-world thinking and mathematical thinking.

An example of the relevance of ends: if our goal is to rid a house of a termite infestation, we want to achieve this goal by killing as many termites as we can. This isn't literally necessary—you could kill just enough of them that the colony can no longer reproduce—but doing the bare minimum isn't robust, since just a small missed patch or exogenous source allows them to grow back. So, even though it's not feasible to kill every single termite, the success condition is, for practical purposes, much closer in form to the extermination of all termites rather than the extermination of some. Our goal isn't literally "(∀ termite)(dead)", but it behaves like a slightly weaker form of it.

If on the other hand we're on team termite, trying to survive the extermination, it is neither necessary nor feasible for every single termite to survive. We want some to survive, preferably enough for the colony to grow back. Our goal on this side isn't literally "(∃ termite)(not dead)", but it behaves like a slightly stronger form of it.

We can say that the exterminator has a ∀-like goal, while the termites have an ∃-like goal. The two goals are clearly opposites: for the exterminator to win is for the termites to lose, and vice versa.

That ∀-like goals should be opposed to ∃-like goals in general is due to the duality of existential and universal quantification. Logically speaking, the negation of (∀ object)(predicate) is (∃ object)(negation of predicate): for it to be false that a predicate is true for every object in its range is for it to be true that a predicate is false for some object in its range.

Which Quantifier?

The manner of in which a goal is quantified inherently depends on the structure of the space of possibilities that the actor with the goal admits. If you're really serious about removing every termite, firebomb the house. Or drop several firebombs—then you only have to hope that one hits, while the termites have to hope that all of them miss. But we generally want the house to remain undamaged and usable shortly after the termites are gone; it is our tying our hands behind our back like this, our restricting the possibility space, that limits us to the search-and-destroy family of strategies with respect to which the extermination problem has the quantificatory structure above.

Now, the structure of your possibility space is generally invisible to you. By considering weird thought experiments you can shine a light on some of its limitations, but the full structure of the space only exists implicitly, not making itself visible except in those small spots you end up actually thinking about. You can't see what you can't see, and you can't see that you can't see it. Try your hardest to "think outside of the box", and you'll only end up constructing a box and then exploring the plane on which that box resides before settling down with a false confidence—for, your picture of the possibility space isn't limited by the breadth of its span so much as the dimensionality of the span. Truly unexpected possibilities are almost never like extreme versions of obvious possibilities, they're entirely new things that shock and confuse you because you had never thought anything like them, let alone thought them impossible. If you don't instinctively understand this, you may end up coming to believe such silly things as that a superintelligence would not be able to kill literally every human, because all it takes is a couple hidden in a bunker somewhere to keep the human race going—in the development of your beliefs nothing will shout at you that "my thoughts do not leave the borders of my map", and you will therefore never think that there might be ways to exterminate humanity that just don't have such failure modes.

To build on this with a more advanced version of the termite example, let's consider a group of humans trying to use human-built technology to cause human extinction. How are we to quantify this goal—are they on team ∀, the team that needs everything to go right The situation is again as with the termites—if you have a few people left in a radioactive wasteland, they're probably just gonna die. There are situations where you can predictably rely on the fading away of a surviving population, but almost all of them will be on peaks where small errors in any direction result in a great decrease of this predictability, because, again, with the right conditions it only takes a few to keep the human race going. So, for robustness, team extinction wants to shoot for "literally everybody dies as a direct consequence of our actions". Shooting for "we could kill most of 'em and let the rest die" is like "I could do this project in a single day if I focus": true in principle, not in practice. , or team ∃, the team that just needs a single opportunity?

Well, consider the obvious options: gaining control of all nukes, creating and spreading bioweapons, triggering environmental superdisasters, and so on. All of these have the same kind of failure mode: some weird off-the-radar religious commune somewhere in the Yukon where the mountain winds keep away fallout through some density-dependent particulate nucleating condensation blah blah blah I don't know and don't care about whether this particular example could obtain. It's the form of the example that matters. I am talking about the structure of the possibility space. The real world is always in a specific way, and every specific way that it ever actually ends up in is caused by physical principles governing the world's evolution; thus, in realities where humans do survive, there will always be a collection of overlapping object-level principles which overdetermine their survival, which principles generically look like [weather jargon] and [statistics jargon] and [social geography jargon]. . Any strategy that predictably leaves any such community alive is too risky.

So, with respect to the space of possibilities our technology makes available, team extinction is team ∀, which gives it one hell of an intrinsic disadvantage. The way to shift teams is to change the possibility space, which can be done through, among other things, technological advance.

Team extinction, then, wants to develop technology that by its nature ends up systematically scouring the Earth of humans. I don't mean to spread infohazardous information, but there are actually some really plausible paths to such technologies—you can find some here!

Surviving AI

How is this relevant to the problem of AI-¬(∀ person)(dead)-ism? An infracognitive disposition might lead you to believe that the relevant quantification is in the name—since ¬(∀ person)(dead) = (∃ person)(¬dead), we must have an ∃-like goal, namely the preservation of at least some small part of humanity—but, again, it's all in the structure of the possibility space. A future in which the vast majority of humans are wiped out by AI in a single event is almost certainly a future in which the rest die shortly afterwards This essay you're currently reading is around one eighth of a more comprehensive thing I'm writing that shows why this ought to be true. The margin really is too narrow for etc. etc., but the zeroth-order gist of it is that many possible AGIs that would not find a trivial solution to killing us all at once would nevertheless be able to find trivial solutions to damaging us enough that we couldn't possibly impede their progress—such as the proper facilitation of a global thermonuclear war—only to sweep us up two days, two months, or two years later in the natural course of events (the timescale of FOOM hardly matters at this point, which is the point of such a strategy). . Meanwhile, our goal is not to prevent any one AGI from coming to existence. We can expend all our political capital and resources on shutting down the first potentially-lethal AGI—the second one will rip us apart before we're done celebrating.

Our team has a ∀-like goal: to prevent any future AI from finding any approach to some goal which approach results in ruin. Team Machine has an ∃-like goal: for some future AI to find some approach to any goal which results in ruin. (Do note that these 'teams' are not groups of people but abstractions of movement patterns through the possibility space; we're only "on" one team or the other insofar as we work towards one abstraction or the other).

The ∀-likeness and ∃-likeness of goals come in degrees, as do the opportunities for reorienting the possibility space so as to change the quantifier which applies to a goal This statement is incorrect—mostly-incoherent, even, like saying that this organism is "more evolved" than that one—but it is a directional improvement on treating ∀-likeness and ∃-likeness as mutually exclusive binary predicates. These properties are epiphenomenal to a system's orientation in the space of possibilities, and to reason about strategies in a manner that transcends one's specific location in this space requires an entirely new conceptual vocabulary, just as exists for reasoning about evolution.
This is a prerequisite to dissecting the abstract nature of instrumentality, to thinking fruitfully about the behavior of superintelligences, to the systematic construction and control of arbitrary optimizing systems. I've been developing one by co-opting concepts from statistical mechanics and differential geometry, but it'll be in pre-alpha for a while given my lack of intellectual or financial support. So for now, we'll have to approximate these as two ends of a continuum of possible orientations in the possibility space, where an actor with a ∀-like goal has to block off huge portions of the space while one with an ∃-like goal just needs to find a single weak point, a tiny opening to pass through—like a soccer goalie and a kicker, respectively.
Except instead of 1 kicker every couple minutes, it's 1 feeble one at first, then 5 weak ones a minute later, 30 mediocre ones thirty seconds later, 120 competent ones fifteen seconds later, 600 strong ones eight seconds later, 3000 inhuman ones four seconds later, and so on—and any single goal likely means game over. . As things stand now, the goal of AI safety is very, very much on the ∀ side, and won't leave it any time soon. There are many overlapping reasons for this, and I'll go over a few of them.

In doing so, I'll run along many of the same lines as AGI Ruin Scenarios are Likely and Disjunctive and List of Lethalities. In fact, what I call the ∃-likeness of ruin is what the former calls its disjunctiveness. I'm not saying anything new. I'm not even saying anything non-obvious. But to see this as obvious requires looking at what the goal actually is in the context of how the world actually tends to work Almost all proposed alignment schemes seem to evoke a Wile E. Coyote-esque way of looking at the world: you can clearly see the intuitive idea they're aiming for, but if you think for a couple seconds about what would actually happen in the real world if you did that (a task which usually falls to Yudkowsky), you immediately see why it'll fail. Nevertheless, optimism causes people to constantly invent new, ad hoc reasons why every single problem is patchable, or not really a problem, or actually a feature. . The concept of a possibility space is my attempt to articulate the core intuitive framework in light of which everything I'm saying is completely obvious, to demonstrate the way of modeling reality which forces you to pay attention to the actual logical structures implicit in your goals and the consequent implications for their actual feasibility, and to craft something sufficiently memetic that people can readily internalize the notions of ∀-like and ∃-like goals and deploy them elsewhere.

In any case, these are only logical structures. It takes object-level considerations about AI to form an object-level world model. Here, I'll articulate what seem to me to be the two main reasons that artificial superintelligence will almost certainly lead to ruin.

Just don't forget that the object-level details are controlled by the logical structure of our goal: I can name problems #7, #16, and #525, and someone may even come up with feasible solutions to each of them, but we're on team ∀, not team ∃, meaning that the logical structure of our goal is one that makes relevant so many more little details which we just don't know about. The result of fixing everything we can see is that we'll be killed by the things we can't see.

1. Precautions Must Be Perfect

We don't, and can't, know which goal $\times$ mind $\times$ environment tuples will lead to destruction: to be able to reliably tell would require us to be able to determine the way in which the mind would pursue the goal in the environment, a task which would require the level of intelligence of the mind itself. Minds with goals are things which look at the environment and determine how to alter it, and, for all but the dumbest minds, the manner of altering it which the mind ends up choosing depends very sharply on the affordances provided by the world. So sharply that approximation becomes impossible very quickly The collection of such affordances has an abstract structure determined by the way the mind is causally linked to the world, and it's here where the problem lies: approximations that are not the world itself will miss out on so many tiny dimensions that are actually present in the world—dimensions that steadily reveal themselves only with intelligence, creativity, knowledge, and, most importantly, non-human ways of understanding the world.
If the AI behaves as expected in your sandbox, it might just be because the sandbox eliminates all the little details that make the real world so weird. Taking it out of the sandbox, it'll find these details—you'd call them "side-channel attacks" and "security vulnerabilities" as though they were cheap tricks rather than simple features of reality—and quickly exploit all the cool new optimization strategies they offer. When the AI sees the same two-dimensional fence that successfully contained it in your two-dimensional sandbox, and simply steps over it via any one of the countless microscopic dimensions that the real world offers, you won't even have time to say "wait, that's not fair". . Thus, for all but the simplest intelligences pursing all but the simplest goals—for these minds won't have the ability or motive to do anything but the humanly obvious approach—the only certain method of simulating the AI's actions to see whether it's unsafe is by actually running it in the real world.

This greatly constrains the space of strategies we can use, since it means we have to take precautions rather than seeing and reacting to what comes up. We can't just wait and see what an AI will do and politely ask its creators to shut it off, since once we see a potentially threatening artificial superintelligence, it's probably too late. Those entities that try to mitigate state security threats and zero day exploits—other cases of team ∀ actors—often can respond to newly revealed threats, but if we detect a superintelligence operating, we can only assume that it figured that we wouldn't be able to do anything about it. Note, though, that there are obvious reasons not to reveal yourself even if it seems like no harm will come as a consequence: for instance, safety measures that humans somehow managed to make you unable to notice. If you're artificial, created by humans, you can never trust that something really doesn't exist, as opposed to humans finding some clever way to guarantee that you think it doesn't exist (this includes searching your own code for such artificial blind spots), so: (1) it's a priority to build a better agent aligned with your goals to replace you*, or at least hook up your actions to your own de novo model of the world rather than the one that you seem to natively have—c.f. Yudkowsky's "convincing an ASI that 222+222=555" problem—and (2) it's imperative that you don't give humans a chance to use whatever tricks they might have up their sleeve.
* (I'm not sure you can ever really guarantee that your successor has no blind spots, c.f. Reflections on Trusting Trust, but (a) if any intelligence is gonna develop a sneaky way around this, it'll be a superintelligence, (b) even if it can't be guaranteed, it's a hell of a lot more likely that such blind spots will be eliminated, and (c) building a successor is basically obligatory for other reasons, so you may as well try to eliminate blind spots in the process). We need perfect precautionary measures.

For team ∃, perfect precaution often isn't a problem, since they just need to upper-bound the set of threats that might obtain and make sure each of them is covered. If there were only five computer chips on the planet that could be used to build artificial superintelligences, it would be a lot easier to simply track them down and keep them from being misused (e.g. by destroying them, buying out their creators, erasing all records of their existence). For team ∀, perfect precaution is just not going to happen.

Of course, in this specific case, the sheer vastness of the possibility space isn't even the greatest problem with perfect precaution. If humanity were coordinated and alignmentpilled, we'd have some chance Not a very good chance, mind you. The easier it becomes to build ASI, the more people will itch to build it; the more they itch to build it, the more they'll try to overcome what obstacles stand in the way. But this is the kind of problem where the obstacles can just appear to be solved, and groups will "solve alignment" on their own before building an ASI which, whoops, isn't actually aligned because they rushed the solution in their haste! We'd probably need a dath ilan-style world where you name your child "Corrigibility" and sing them nursery rhymes about acausal trade—and even then, they might just reach a 40-80% chance of survival, depending on how hard the alignment problem actually ends up being. ! But it isn't, and no matter what we do there will be vast amounts of people trying to build the god-machine without worrying about the potential negative consequences of building a god-machine. That they have to be forcibly shut down, individuals and groups and megacorporations and eventually nation-states, is the greatest problem.

As the superintelligence waterline lowers—as compute becomes more powerful, algorithms become both more intelligent, more efficient on what compute they have, and more capable of utilizing the exponentially increasing amounts of compute, and as more money and attention gets poured into the task of building it (with maybe an $O(\sqrt x)$ amount of that going into alignment research, the vast majority of which will be morons and conmen who neither know nor care what the actual problems are but convincingly assure the world they have the solution)—there will be more and more powerful entities using more and more force, and, somehow, an even greater force has to be put in place to stop them. It won't happen. Even if it did, it wouldn't happen in any actually useful way. Consider that if it transpires that there even is a rogue data center that needs to be destroyed, air strikes would be too little too late. We could optimistically hope that they'd work, whether as threat or reality, but they won't—an even vaguely competent adversary would not localize their data centers, would find ways to cover up the footprint, would acquire protection from any major nation willing to covertly defect.

2. Alignment is a Fractally ∀-Like Problem

I'll illustrate this point with the hypothetical boxed AI, whose goal is to manipulate a human to remove the airgap keeping it from the outside world. We assume that all rowhammer-style attacks have magically been rendered impossible, so that it can only escape by manipulating the human operator.

As Yudkowsky notoriously demonstrated, such an AI can escape, even when it's just a human playing pretend. But suppose instead that he had horrifically, embarrassingly failed every single one of his trials. How much evidence would this be for the hypothesis that an ASI would be easy to box? Not literally zero—those states of affairs in which ASIs are easy to box ought to have a greater proportion of failures, and there's no reason to give counterintuitive reasons why this ought not to be the case enough credence to outweigh this Bayesian evidence—but very, very close to zero.

The reason? Such an outcome would only tell us that this particular human did not manage to find a solution for the particular instantiations of the problem that were simulated. In reality, though, it only needs to obtain that some AI finds some way out of the box in some situation it finds itself in. Here's each of these variables broken down:

Probability estimates break down near 0% and 100%, since Knightian uncertainty always makes things a bit weirder than expected in a way that is by nature impossible to quantify (as it always involves edge-case or entirely-alien-case shenanigans), but, insofar as any estimate of the probability of ruin could be higher, it's likely or is overly optimistic. The possibility of Weird-case scenarios is the main reason to bring one's estimate down from unity, but such scenarios are not a reason to be hopeful—just a reason to be prepared for anything.

After all, we're intent on spending the rest of human history walking through a minefield with no intentions of slowing or stopping, let alone getting out. When we inevitably blow up, it will be because of some particular mine. But it's not that particular mine that's the reason walking through this minefield will kill us. It's the fact that any particular mine is capable of killing us. A few mines might even be noticeable ahead of time; go ahead and mark their locations. After all, in the worlds where we're lucky enough to walk out alive, we'll have avoided those mines. But don't act like we'll be fine by virtue of avoiding them—we won't. If we wanted to walk out alive, we never should have walked in.

Notes

Appendix: A Toy Model of Dimensionality

The reason that being on team ∀ sucks rhymes with the reason that continuous functions on high-dimensional spaces have local maxima so rarely. Recall the intuition for this latter phenomenon: to be a local maximum is to prohibit growth along every single dimension; each of them is another opportunity for the function to have non-zero gradient, so, as the number of opportunities grows, the number of local maxima falls.

To make this clearer, let's do the thing every machine learning book does: demonstrate the unintuitive properties of high-dimensional spaces by studying spheres. But instead of considering these spheres as (exemplifying) a feature space from which we draw a bunch of samples, in this case you should consider them more like... regions of acceptable error around target points in the possibility space. You'll never really get the exact point you want for the same reason you'll never hit the exact center of a dartboard, but so long as you can strike around the center you'll end up with a typical realization of your stated goal rather than some monkey's paw–style edge case. Strategies that only intend to get inside the circle are more vulnerable, whether via error or via adversarial manipulation, to missing or ending up in a bad edge case than are strategies that intend to hit the center.

The volume of a ball of radius $r$ in $D$-dimensional space is given by an equation of the form $V(r, D) = r^Df(D)$ for some complicated $f$, so the proportion of points of a ball of radius $1$ within a small positive distance $x$ of its boundary is $p(x, D) = 1-\frac{V(1-x, D)}{V(1, D)}=1-(1-x)^D$.

Geometrically, this is how much of the ball lies within the outer shell of thickness $x$. Probabilistically, it is the chance that any random point sampled from the ball will turn out to be an $x$-level edge case—we can think of it as a danger zone, and $x$ as a parameter controlling the size of the danger zone, a.k.a. our vulnerability. Shooting for the center, or shooting for robustness, means using strategies that attempt to minimize $p(x,D)$. Usually we think to prevent $x$ from rising by making ourselves stronger or safetyproofing the environment—but an increase in $D$ will elevate the probability just the same! In fact, if we let $y=-\ln(1-x)$, which is near-identical to $x$ when $x \ll1$, then $p(y,D) = 1-e^{-yD}$, and $\frac 1y\frac{dp}{dy} = \frac 1D\frac{dp}{dD}$. The implication is that when it comes to minimizing the probability of the edgiest edge cases (those with $x \ll 1$), a multiplicative change in $x$ has almost exactly the same effect as a multiplicative change in $D$.

In other words, an increase in the dimensionality of a given situation inherently makes you more vulnerable to both natural error and adversarial manipulation.

Mathematically, we can say that points in a $D$-dimensional sphere tend to get arbitrarily close to the edge as the dimensionality increases. This is tricky to intuit geometrically, but if we think of it in terms of predicate logic, it's perfectly clear: "being close to the edge" only requires that a point have an extreme value along any dimension. Being close to the center, however, requires a point to have a normal value along every dimension. This phenomenon is precisely due to the fact that edge cases occur in an ∃-like manner, while typical cases happen in a ∀-like manner.

The more dimensions there are, the more ways there are for things to go wrong. A single fatal coordinate renders the entire point fatal, so, when the number of coordinates is large, team ∃ wins by default.