(Published May 15, 2023 by Mariven)
Here, I'll give you a formalism for studying the logical structure of goals, and, using it to characterize the space of possibilities through which AI ruin may end up obtaining, demonstrate what exactly it is about the alignment problem that prevents us from solving it.
A/N: This was written for a popular audience, so it lacks the branching writing structures I usually use to explain my reasoning; as such, it's more evocative than explicative compared to my other writing. The main ideas presented herein form one part of a much more comprehensive framework for thinking about optimization in general, to be introduced in an as-yet-unfinished essay. This essay, Worldspace, will treat of the logical, physical, and computational structure of what I call here the "possibility space", demonstrating how this structure can be used to understand and predict the convergent behaviors of superintelligences.
In propositional (zeroth-order) logic, we work with propositions, or statements that are unambiguously either true or false. In predicate (first-order) logic, we work with abstract properties that, when applied to some particular object $x$, yield propositions whose truth value turns on whether $x$ has the property. If $P$ represents the property of being mortal—we can call it the mortality predicate—then the proposition $P(x)$ is true when $x$ refers to a living thing that happens to be mortal, and false otherwise.
Unlike a proposition, a predicate does not have a truth value in itself; nevertheless, we can wring one out of it by finding some way to quantify the way in which it holds across all $x$ of the correct type—in this case, $x$ can vary across the set $X$ of all living things. We call this method of turning predicates into propositions quantification. There are two common ways of quantifying the manner in which $P$ holds across $X$:
Let's talk about when and how this general notion of quantification—of making a generalized predicate into a concrete proposition by talking about what can be said about its space of possible instantiations—can be put to use in our thinking about the real world. To this end, let's characterize the differences between real-world thinking and mathematical thinking.
An example of the relevance of ends: if our goal is to rid a house of a termite infestation, we want to achieve this goal by killing as many termites as we can. This isn't literally necessary—you could kill just enough of them that the colony can no longer reproduce—but doing the bare minimum isn't robust, since just a small missed patch or exogenous source allows them to grow back. So, even though it's not feasible to kill every single termite, the success condition is, for practical purposes, much closer in form to the extermination of all termites rather than the extermination of some. Our goal isn't literally "(∀ termite)(dead)", but it behaves like a slightly weaker form of it.
If on the other hand we're on team termite, trying to survive the extermination, it is neither necessary nor feasible for every single termite to survive. We want some to survive, preferably enough for the colony to grow back. Our goal on this side isn't literally "(∃ termite)(not dead)", but it behaves like a slightly stronger form of it.
We can say that the exterminator has a ∀-like goal, while the termites have an ∃-like goal. The two goals are clearly opposites: for the exterminator to win is for the termites to lose, and vice versa.
That ∀-like goals should be opposed to ∃-like goals in general is due to the duality of existential and universal quantification. Logically speaking, the negation of (∀ object)(predicate)
is (∃ object)(negation of predicate)
: for it to be false that a predicate is true for every object in its range is for it to be true that a predicate is false for some object in its range.
The manner of in which a goal is quantified inherently depends on the structure of the space of possibilities that the actor with the goal admits. If you're really serious about removing every termite, firebomb the house. Or drop several firebombs—then you only have to hope that one hits, while the termites have to hope that all of them miss. But we generally want the house to remain undamaged and usable shortly after the termites are gone; it is our tying our hands behind our back like this, our restricting the possibility space, that limits us to the search-and-destroy family of strategies with respect to which the extermination problem has the quantificatory structure above.
Now, the structure of your possibility space is generally invisible to you. By considering weird thought experiments you can shine a light on some of its limitations, but the full structure of the space only exists implicitly, not making itself visible except in those small spots you end up actually thinking about. You can't see what you can't see, and you can't see that you can't see it. Try your hardest to "think outside of the box", and you'll only end up constructing a box and then exploring the plane on which that box resides before settling down with a false confidence—for, your picture of the possibility space isn't limited by the breadth of its span so much as the dimensionality of the span. Truly unexpected possibilities are almost never like extreme versions of obvious possibilities, they're entirely new things that shock and confuse you because you had never thought anything like them, let alone thought them impossible. If you don't instinctively understand this, you may end up coming to believe such silly things as that a superintelligence would not be able to kill literally every human, because all it takes is a couple hidden in a bunker somewhere to keep the human race going—in the development of your beliefs nothing will shout at you that "my thoughts do not leave the borders of my map", and you will therefore never think that there might be ways to exterminate humanity that just don't have such failure modes.
To build on this with a more advanced version of the termite example, let's consider a group of humans trying to use human-built technology to cause human extinction. How are we to quantify this goal—are they on team ∀, the team that needs everything to go right
Well, consider the obvious options: gaining control of all nukes, creating and spreading bioweapons, triggering environmental superdisasters, and so on. All of these have the same kind of failure mode: some weird off-the-radar religious commune somewhere in the Yukon where the mountain winds keep away fallout through some density-dependent particulate nucleating condensation blah blah blah
So, with respect to the space of possibilities our technology makes available, team extinction is team ∀, which gives it one hell of an intrinsic disadvantage. The way to shift teams is to change the possibility space, which can be done through, among other things, technological advance.
Team extinction, then, wants to develop technology that by its nature ends up systematically scouring the Earth of humans. I don't mean to spread infohazardous information, but there are actually some really plausible paths to such technologies—you can find some here!
How is this relevant to the problem of AI-¬(∀ person)(dead)
-ism? An infracognitive disposition might lead you to believe that the relevant quantification is in the name—since ¬(∀ person)(dead) = (∃ person)(¬dead)
, we must have an ∃-like goal, namely the preservation of at least some small part of humanity—but, again, it's all in the structure of the possibility space. A future in which the vast majority of humans are wiped out by AI in a single event is almost certainly a future in which the rest die shortly afterwards
Our team has a ∀-like goal: to prevent any future AI from finding any approach to some goal which approach results in ruin. Team Machine has an ∃-like goal: for some future AI to find some approach to any goal which results in ruin. (Do note that these 'teams' are not groups of people but abstractions of movement patterns through the possibility space; we're only "on" one team or the other insofar as we work towards one abstraction or the other).
The ∀-likeness and ∃-likeness of goals come in degrees, as do the opportunities for reorienting the possibility space so as to change the quantifier which applies to a goal
This is a prerequisite to dissecting the abstract nature of instrumentality, to thinking fruitfully about the behavior of superintelligences, to the systematic construction and control of arbitrary optimizing systems. I've been developing one by co-opting concepts from statistical mechanics and differential geometry, but it'll be in pre-alpha for a while given my lack of intellectual or financial support. So for now, we'll have to approximate these as two ends of a continuum of possible orientations in the possibility space, where an actor with a ∀-like goal has to block off huge portions of the space while one with an ∃-like goal just needs to find a single weak point, a tiny opening to pass through—like a soccer goalie and a kicker, respectively.
Except instead of 1 kicker every couple minutes, it's 1 feeble one at first, then 5 weak ones a minute later, 30 mediocre ones thirty seconds later, 120 competent ones fifteen seconds later, 600 strong ones eight seconds later, 3000 inhuman ones four seconds later, and so on—and any single goal likely means game over.
In doing so, I'll run along many of the same lines as AGI Ruin Scenarios are Likely and Disjunctive and List of Lethalities. In fact, what I call the ∃-likeness of ruin is what the former calls its disjunctiveness. I'm not saying anything new. I'm not even saying anything non-obvious. But to see this as obvious requires looking at what the goal actually is in the context of how the world actually tends to work
In any case, these are only logical structures. It takes object-level considerations about AI to form an object-level world model. Here, I'll articulate what seem to me to be the two main reasons that artificial superintelligence will almost certainly lead to ruin.
Just don't forget that the object-level details are controlled by the logical structure of our goal: I can name problems #7, #16, and #525, and someone may even come up with feasible solutions to each of them, but we're on team ∀, not team ∃, meaning that the logical structure of our goal is one that makes relevant so many more little details which we just don't know about. The result of fixing everything we can see is that we'll be killed by the things we can't see.
We don't, and can't, know which goal $\times$ mind $\times$ environment tuples will lead to destruction: to be able to reliably tell would require us to be able to determine the way in which the mind would pursue the goal in the environment, a task which would require the level of intelligence of the mind itself. Minds with goals are things which look at the environment and determine how to alter it, and, for all but the dumbest minds, the manner of altering it which the mind ends up choosing depends very sharply on the affordances provided by the world. So sharply that approximation becomes impossible very quickly
If the AI behaves as expected in your sandbox, it might just be because the sandbox eliminates all the little details that make the real world so weird. Taking it out of the sandbox, it'll find these details—you'd call them "side-channel attacks" and "security vulnerabilities" as though they were cheap tricks rather than simple features of reality—and quickly exploit all the cool new optimization strategies they offer. When the AI sees the same two-dimensional fence that successfully contained it in your two-dimensional sandbox, and simply steps over it via any one of the countless microscopic dimensions that the real world offers, you won't even have time to say "wait, that's not fair".
This greatly constrains the space of strategies we can use, since it means we have to take precautions rather than seeing and reacting to what comes up. We can't just wait and see what an AI will do and politely ask its creators to shut it off, since once we see a potentially threatening artificial superintelligence, it's probably too late. Those entities that try to mitigate state security threats and zero day exploits—other cases of team ∀ actors—often can respond to newly revealed threats, but if we detect a superintelligence operating, we can only assume that it figured that we wouldn't be able to do anything about it.
* (I'm not sure you can ever really guarantee that your successor has no blind spots, c.f. Reflections on Trusting Trust, but (a) if any intelligence is gonna develop a sneaky way around this, it'll be a superintelligence, (b) even if it can't be guaranteed, it's a hell of a lot more likely that such blind spots will be eliminated, and (c) building a successor is basically obligatory for other reasons, so you may as well try to eliminate blind spots in the process).
For team ∃, perfect precaution often isn't a problem, since they just need to upper-bound the set of threats that might obtain and make sure each of them is covered. If there were only five computer chips on the planet that could be used to build artificial superintelligences, it would be a lot easier to simply track them down and keep them from being misused (e.g. by destroying them, buying out their creators, erasing all records of their existence). For team ∀, perfect precaution is just not going to happen.
Of course, in this specific case, the sheer vastness of the possibility space isn't even the greatest problem with perfect precaution. If humanity were coordinated and alignmentpilled, we'd have some chance
As the superintelligence waterline lowers—as compute becomes more powerful, algorithms become both more intelligent, more efficient on what compute they have, and more capable of utilizing the exponentially increasing amounts of compute, and as more money and attention gets poured into the task of building it (with maybe an $O(\sqrt x)$ amount of that going into alignment research, the vast majority of which will be morons and conmen who neither know nor care what the actual problems are but convincingly assure the world they have the solution)—there will be more and more powerful entities using more and more force, and, somehow, an even greater force has to be put in place to stop them. It won't happen. Even if it did, it wouldn't happen in any actually useful way. Consider that if it transpires that there even is a rogue data center that needs to be destroyed, air strikes would be too little too late. We could optimistically hope that they'd work, whether as threat or reality, but they won't—an even vaguely competent adversary would not localize their data centers, would find ways to cover up the footprint, would acquire protection from any major nation willing to covertly defect.
I'll illustrate this point with the hypothetical boxed AI, whose goal is to manipulate a human to remove the airgap keeping it from the outside world. We assume that all rowhammer-style attacks have magically been rendered impossible, so that it can only escape by manipulating the human operator.
As Yudkowsky notoriously demonstrated, such an AI can escape, even when it's just a human playing pretend. But suppose instead that he had horrifically, embarrassingly failed every single one of his trials. How much evidence would this be for the hypothesis that an ASI would be easy to box? Not literally zero—those states of affairs in which ASIs are easy to box ought to have a greater proportion of failures, and there's no reason to give counterintuitive reasons why this ought not to be the case enough credence to outweigh this Bayesian evidence—but very, very close to zero.
The reason? Such an outcome would only tell us that this particular human did not manage to find a solution for the particular instantiations of the problem that were simulated. In reality, though, it only needs to obtain that some AI finds some way out of the box in some situation it finds itself in. Here's each of these variables broken down:
utf16lparse.c: invalid delimiter 0x25AE (read_in:551)
can make you get some nerd to take a look while you get a coffee, it now has someone a lot easier to manipulate. But, as easy as it would be, such string-pulling probably won't even be necessary. After all, why even build an AI if you've precommitted to ignoring anything it does? By the very fact of your building it you admit that you want to be influenced in some way by its output. You want to see it fix some other program. You want to see it devise some market strategy. You want to see it produce a mathematical proof. It's only a question of how you let it influence you. Probability estimates break down near 0% and 100%, since Knightian uncertainty always makes things a bit weirder than expected in a way that is by nature impossible to quantify (as it always involves edge-case or entirely-alien-case shenanigans), but, insofar as any estimate of the probability of ruin could be higher, it's likely or is overly optimistic. The possibility of Weird-case scenarios is the main reason to bring one's estimate down from unity, but such scenarios are not a reason to be hopeful—just a reason to be prepared for anything.
After all, we're intent on spending the rest of human history walking through a minefield with no intentions of slowing or stopping, let alone getting out. When we inevitably blow up, it will be because of some particular mine. But it's not that particular mine that's the reason walking through this minefield will kill us. It's the fact that any particular mine is capable of killing us. A few mines might even be noticeable ahead of time; go ahead and mark their locations. After all, in the worlds where we're lucky enough to walk out alive, we'll have avoided those mines. But don't act like we'll be fine by virtue of avoiding them—we won't. If we wanted to walk out alive, we never should have walked in.
The reason that being on team ∀ sucks rhymes with the reason that continuous functions on high-dimensional spaces have local maxima so rarely. Recall the intuition for this latter phenomenon: to be a local maximum is to prohibit growth along every single dimension; each of them is another opportunity for the function to have non-zero gradient, so, as the number of opportunities grows, the number of local maxima falls.
To make this clearer, let's do the thing every machine learning book does: demonstrate the unintuitive properties of high-dimensional spaces by studying spheres. But instead of considering these spheres as (exemplifying) a feature space from which we draw a bunch of samples, in this case you should consider them more like... regions of acceptable error around target points in the possibility space. You'll never really get the exact point you want for the same reason you'll never hit the exact center of a dartboard, but so long as you can strike around the center you'll end up with a typical realization of your stated goal rather than some monkey's paw–style edge case. Strategies that only intend to get inside the circle are more vulnerable, whether via error or via adversarial manipulation, to missing or ending up in a bad edge case than are strategies that intend to hit the center.
The volume of a ball of radius $r$ in $D$-dimensional space is given by an equation of the form $V(r, D) = r^Df(D)$ for some complicated $f$, so the proportion of points of a ball of radius $1$ within a small positive distance $x$ of its boundary is $p(x, D) = 1-\frac{V(1-x, D)}{V(1, D)}=1-(1-x)^D$.
Geometrically, this is how much of the ball lies within the outer shell of thickness $x$. Probabilistically, it is the chance that any random point sampled from the ball will turn out to be an $x$-level edge case—we can think of it as a danger zone, and $x$ as a parameter controlling the size of the danger zone, a.k.a. our vulnerability. Shooting for the center, or shooting for robustness, means using strategies that attempt to minimize $p(x,D)$. Usually we think to prevent $x$ from rising by making ourselves stronger or safetyproofing the environment—but an increase in $D$ will elevate the probability just the same! In fact, if we let $y=-\ln(1-x)$, which is near-identical to $x$ when $x \ll1$, then $p(y,D) = 1-e^{-yD}$, and $\frac 1y\frac{dp}{dy} = \frac 1D\frac{dp}{dD}$. The implication is that when it comes to minimizing the probability of the edgiest edge cases (those with $x \ll 1$), a multiplicative change in $x$ has almost exactly the same effect as a multiplicative change in $D$.
In other words, an increase in the dimensionality of a given situation inherently makes you more vulnerable to both natural error and adversarial manipulation.
Mathematically, we can say that points in a $D$-dimensional sphere tend to get arbitrarily close to the edge as the dimensionality increases. This is tricky to intuit geometrically, but if we think of it in terms of predicate logic, it's perfectly clear: "being close to the edge" only requires that a point have an extreme value along any dimension. Being close to the center, however, requires a point to have a normal value along every dimension. This phenomenon is precisely due to the fact that edge cases occur in an ∃-like manner, while typical cases happen in a ∀-like manner.
The more dimensions there are, the more ways there are for things to go wrong. A single fatal coordinate renders the entire point fatal, so, when the number of coordinates is large, team ∃ wins by default.