Abstraction's End
home page
original site
twitter
thoughts
about
Linked TOC

Here, I outline a novel strain of approach to the AI safety problem: indexical bounding. This is an architecture-agnostic framework for extracting knowledge and decisions from advanced AI in a way that is safe without being adversarial or handicapping, which allows it to have an actual chance of reducing actual x-risk in our actual world. These desiderata (being architecture-agnostic, non-adversarial, and non-handicapping) allow realistically achievable methodology for developing and deploying the approach; this safety as a service methodology is also outlined. The theory of indexical bounding is intrinsically tied to the exploration of what it means to be an intelligent agent in a world.

Main points:

  1. (Solution structure) By looking at the general structure of the alignment problem—the obstacles that make it so difficult, and the common properties of surviving worlds—we can infer some properties that potential solutions must have in order to have any chance of actually succeeding in the world as we know it. We can patch these desiderata together in a more-or-less natural way to obtain a particular shape that our approaches ought to hew to. The solution most likely to work is not a fully-general one (which will likely be near-impossible to develop, very difficult to adapt to specifics even if it is developed, and reliant on conceptual advances very dangerous to openly exposit in the first place), but a dynamic one which builds partially-general strategies that are meant to make existing architectures safe and to be easily adaptable to specific systems being developed by specific groups. Hence the "safety as a service" methodology—the creation of a platform that other groups can rely on when they find that they have overpowered AI systems that they don't know how to make safe.
  2. (Indexical bounding) I'm proposing indexical bounding as a conceptual alignment program upon which such a platform can be built. While Solomonoff inductors with unbounded compute can't (realistically) be made safe, the AIs we have to safetyproof must act intelligent with small amounts of compute Under say $10^{100}$ ops, which is nothing to a Solomonoff inductor but comfortably fits all computations humans have ever performed and will perform prior to the emergence of superintelligence. (If you think this is too low, then square it a couple times -- nothing changes).. The production of intelligence under this restraint tends to utilize an internal representational semantics through which one predicts what sensations and actions they will receive and dispatch over a (generically) small set of causal channels. By performing a certain kind of man-in-the-middle attack on these channels, we can essentially contain an intelligence in its own world model, "indexically bounding" it within a box which relies on the contained intelligence itself to model (and thereby neutralize) the results of any escape attempts that it happens to cognize. This very obviously wouldn't contain a Solomonoff inductor, or even a 'higher' superintelligence, such as one that's undergone significant recursive self-improvement, but it should help us with the first few critical superintelligences.

Notes on writing:

  1. I'll use the term "alignment" metonymously. My fundamental goal is to help point AI towards a good future for all sentient life, and this is the main "alignment problem" for me. Speaking of "pointability towards arbitrary goals" is a useful step in the decomposition of this problem, but when I speak about the "the alignment problem" in full generality, I'm talking about making sure that AI makes things better, full stop. "Ruin" is a family of ways to fail this problem (as are lock-ins and s-risks), and "control" and "safety" are two more specific families of subproblems, or ways of looking at the main problem, which are required to succeed.
  2. Some specifics have intentionally been obscured. Worthwhile proposals concerning alignment theory are often just as capable of advancing AI capabilities as they are alignment, and worthwhile strategies concerning implementation are similarly dual-use This dual-use tendency is inherent to sufficiently abstract thought about intelligence. For me, this inherence often manifests in the form of curious resonances: a particular framework through which I'm trying to cognize intelligence ends up benefiting the actual coordination of my own intelligence, as though I had planned it to be multifunctional all along. The general reason for this resonance is because the method of the investigation is the target of the investigation: we think of intelligence through intelligence, and thought will benefit itself. In other words, any understanding of intelligent inference should also serve as a way to improve your own cognition of the world, and any understanding of intelligent decision should also serve as a way to improve your own planning skills. This dual use is a recurring feature in the ideas I'm trying to present (hence the caution), but an example you'll already be familiar with is Bayesianism: a descriptive model of rational thinking that becomes a useful normative model for actual thinking. Now, consider what would happen if you found a way to think about intelligence with the dual function of helping you think better about intelligence: the potential for recursive self-improvement is obvious. Since humans can't really program their thoughts freely, though, positive feedback loops in our understanding of intelligence can take generations to play out (consider the history of the notion of computation). I know we can do better, because I've found ways to do it. These ways are really unsafe to do alone, though, due to drastic changes to cognitive structure and self-concept such as take away one's ability to guarantee that the being at the other end of the process would act in line with their own goals. I believe that this is mostly an engineering problem -- that powerful meditative tools can be turned towards neuropsychological RSI, rendering most other intelligence improvement programmes (e.g. genetic engineering) superfluous. . .

Solution Structure

The Actual Nature of the Problem

In general, what shapes do realistic solutions to the alignment problem take on? It can be assumed that this problem Whether you want to interpret it in the general sense above or in the specific sense of "safety" or of "control"; it makes no difference. has a deadline of between two and twenty years if current trends continue (and there's no good reason to believe they won't), that we generically fail if we do not intentionally solve it and implement our solution, and that we will not get to solve it by building an aligned AGI from scratch, but instead by aligning those AGIs that others will try to create. This latter point means that almost all alignment approaches currently being worked on are destined to fail, even if they would theoretically be successful if given enough time and resources -- the problem is not to show how the race might be possibly finished safely, but to make sure the race is actually won safely. In general, people who try to think in vague terms of "figuring out how to make progress" on the alignment problem rather than thinking of effecting worldstates in which the problem is solved In other words, people who seek to "try (to win)" rather than to "(try to) win". fall into one of several traps:

  1. A lack of focus on the actual problem. Most would-be solutions to the alignment problem are of the form "here's how you build an aligned AGI" rather than "here's how you align an AGI"—they presuppose some theoretical structure that the AI ought to have in order for their plan to work which is just entirely irrelevant to the actual facts of the actual situation, seemingly not recognizing that whatever perfect solution they devise amounts to nothing if the solution won't actually be implemented! To be uncharitable, they systematically fool themselves as to their actual reasons for doing things. They want to work on interesting problems, and they want to feel like they're saving the world, so they take the path of least resistance by convincing themselves that their interesting problems could help save the world.
    1. Even if their plan would work for aligning an AGI with such a structure, so what? DeepAI Labs still releases PBJ-5.99 along the same architectural lines they've been pursuing for years, an independent researcher posts some hacky interface that lets it fully utilize some latent ability (surpassing context bounds, access to the base model, whatever) to Github, some CS major puts it all together and decides it'd be funny to tell a jailbroken and just-powerful-enough AutoPBJ to destroy humanity... and humanity is actually just destroyed! Or some other permutation of stupid, predictable events happens to destroy humanity Emphasis on the stupidity. People get images in their heads of some grand Hegelian drama that the world has to go through prior to climaxing at superintelligence, collapsing for recognizable reasons into a thematic structure, and giving rise to a gratifying resolution. No. Even our picture of what AGI would look like was woven into such a drama, wherein clever mathematical architectures would be proposed which we'd have to safetyproof. But then deep learning stumbled onto an extremely stupid fact: take that silly text generator that can't add two-digit numbers, scale it way up, and it's now an omniscient alien being with a personality disorder. Tomorrow's world has checked in early to tell us that today's shitshow isn't going away..
    2. So they need to accomplish their plan early. This means not just solving alignment but also being among the first to solve AGI. But let's suppose their plan is just that good—miraculously, the theoretical structure it specifies is in fact an immediately actionable model of general intelligence.
    3. But even if they do, so what? Even if they manage to build the god-machine in what secrecy is required to prevent organizations with 100x the resources and manpower from just yoinking their plan and either screwing it up and killing everybody or succeeding and taking control After all, orthogonality means that a structure for aligned AI is also a structure for unaligned AI—and among all the major governments and companies with the capacity to surveil such work, plenty of people will think that alignment's just not a problem and that they would do better with the god-machine, or that their personal values are correct and they would do better with the god-machine, or etc. etc., will they perform what actions are actually necessary to secure the world's prosperity? Or will they go public with it and soliloquize about the need for diplomacy before getting wrecked by the Mossad? Unilateral pivotal acts will be both necessary in the vast majority of futures where we succeed and culpable for a large fraction of the worlds where we fail, and most people just seem not to think about this issue with any actual depth. Yet it's the issue of implementation that determines the shape of successful alignment approaches!
  2. Irrelevant techniques. If your plan routes through simulating people or universes, or through universal priors or AIXI approximators, you're obviously, predictably, clearly going to fail. Such approaches can be useful conceptually, as thought experiments to clarify various causal gears and flows Though I think they've been harmful on net MIRI's textbook excuse has been "if we don't know how to get an AI to reliably and safely do one particular thing given infinite compute, there's no way we're gonna do it with finite compute". This is a flimsy justification to solve fun mathematical problems instead of realistic problems It's more or less clear at this point that what nontrivial work there is to do on tiling agents, logical inductors, acausal trades, or other possible influences on the behavior of superintelligences will be done not by humans but by the actual superintelligences we end up creating, right?. Allowing hypercomputation, or even just arbitrarily large amounts of compute, fundamentally changes things!
    $\quad$ Consider mathematics, in which hypercomputation trivializes problems like Goldbach's conjecture: write a program $P$ that goes through each $n\in\{4, 6, \ldots\}$ and finds a $p \in \{2,\ldots,n/2\}$ with $p$ and $n-p$ prime, halting if it finds no such $p$ for some $n$. Load it onto a hypercomputer, and turn it on—if it doesn't halt instantly, you've proven the conjecture. What can you learn about Goldbach's conjecture from this approach? Nothing, aside from clarification of the basic logical structure. The program $P$ is to the study of the Goldbach conjecture as Solomonoff induction is to the study of intelligence: it's a wrapper around a hypercomputer which we employ in order to avoid tackling the actually hard part of the problem that arises when we try to solve it without hypercomputers. In the case of the Goldbach conjecture, we have to efficiently strike down infinitely many possible counterexamples in a very small number of computational steps Anything under ${10^{100}}$ steps is a dust speck compared to the number of steps it would take for Solomonoff induction to begin working in our world; the actually hard question is, what is it that computational processes this small must do in order to display intelligence?, which requires number theory. In the case of intelligence, we have to efficiently model infinitely many possible generating procedures in a very small number of computational steps, which requires (a system that effects something like) conceptual cognition.
    $\quad$ If you try to do mathematics as though you had a hypercomputer, the only theory you'll end up building is the mathematical logic/computability theoretic analysis of "what problems could we hypothetically solve with infinite compute?"—and this won't even be useful to you in the end, because you don't have a hypercomputer. You can make up all the object-level reasons you want for why this metaphor doesn't work and actually it's a great way to study intelligence, but MIRI's history is proof that the metaphor does, in fact, work.
    , and personally try not to assume the existence of well-defined utility functions or even probability distributions unless I clearly understand how they're admissible as approximations in a given situation and therefore where it is and isn't safe to reason in terms of them. C.f. the discussion on representationalism below—I don't actually expect 'representations' to exist as coherent, singular things in generic minds, but my model of the underlying reality suggests that it's safe to simplify the discussion by merely pretending they exist.
    , but they should never make it into proposed solutions.
  3. The same goes with universal priors, whether on complexity or speed. Again, often useful conceptually, but what plan relying on inference w/r/t a universal prior is actually, physically, in this real world, going to prevent ruin?
  4. In general, overreliance on mathematical frameworks. Not being able to realize when your formalisms are hurting you rather than helping you; not knowing how to look for, or independently construct, new formalisms This is one major reason (among several others) that you can't just hire Fields medalists to work on the safety problem. You need people who can recognize when their fancy toolkits aren't helping them solve the problem and learn to discard it and find another rather than trying to wield it better, upgrade it, jury rig it to work. I would be delighted if I could solve category-theoretic problems all day under the name of 'alignment', but I have the requisite sense of practicality to know that that won't go anywhere. Fields medalists never really do, else they'd be doing something useful with all their talent. . Not that I'm math-shy (I'm really not) but all use of mathematics confines you, whether explicitly or implicity (i.e., whether you realize it or not), to a certain model, a certain demarcation of the situation, a certain fixed way of looking at the world; and the current development of AI just isn't fitting into the demarcations that safety researchers have traditionally used in order to tackle fundamental questions concerning intelligence and agency. Decision theory, utility functions -- what good are they in understanding how GPT-4, and the AIs that will come after it, are capable of exhibiting intelligence? Such issues are what motivate my creation of new, different metaconceptual frameworks for thinking about intelligent behavior; the Worldspace paradigm, for instance, is an attempt to model intelligence that lends itself to the formulation of behavioral heuristics grounded in statistical mechanics and information theory rather than behavioral bounds grounded in formal logic and game theory.

So, what does an actual solution to alignment in this world look like? There are a couple of obvious desiderata:

  1. Architecture agnosticism.
    We don't know what the first few transformative AIs will look like—what architectures they'll have. We almost certainly won't know how they're producing their intelligence. But THEY are the ones we ought to worry about. We have to have a solution for alignment that somehow works for them.
  2. Non-handicapping.
    If our solution relies on somehow making the AI stupid, either whatever research outfit built the AI will reject us, or ruin will just come about a few weeks or months later, whether from a non-stupid fork, a sibling trained elsewhere, a direct competitor, or a total stranger. Its intelligence needs to be used, and there's no reason to believe that we'll be able to make it stupid in a small set of tactically chosen areas without spoiling its intelligence more broadly.
  3. Non-adversariality.
    Our solution should not fundamentally work by pitting us, or really anything that can affect it in turn, "against" the AI—it should not have reason to cognize something standing in its way, something to trick or evade or destroy. (This leads directly to the notion of indexical bounding).
  4. Convenience.
    The people building the AI cannot be expected to seriously care about -- let alone be able to skillfully think about -- alignment. They're having fun and pretending that they're thereby saving the world, like everyone else, and will pay lip service to proposed solutions; should these solutions get too complicated, too expensive, too difficult, they'll push them away in irritation and the world will be destroyed shortly thereafter. So the solution ought to be something natural and modular. Something with easily adaptable prototypes that can quickly be adjusted, tested, and deployed.
  5. Convincingness.
    For roughly the same reason, the solution ought to clearly be a solution, something that people can look at and say, "Yes, that's what we need to do", or at least be readily and robustly convinced of via whatever social, political, or occupational pressure can be mounted. It would be nice to have a perfectly clean victory, where we find a provably correct universal solution and convince everyone to adopt it with the power of logic. But it won't happen; it's too late for doom prevention, so we've got to run doom control.

Safety as a Service. One way of framing the the above is: an actual real solution to alignment that will actually really work in this actual real world ought to be structured as an offered service. We ought to have a sort of "harness" which can utilize the intelligence present in an AI model in a safe way even when the model may not be safe if used directly. To build such a harness into which any intelligence can be plugged in is probably far too difficult—but there are certain specific features that we can be pretty sure will be shared by the AIs we'll end up targeting, and we can capitalize on these features to massively reduce the difficulty of the problem.
!tab For instance, by the very fact that they'll be programs running on computers, we know that we can fork them and have the forks interact in any way we want (e.g. in LangChain-like networks), we can save and load their states, we can manipulate their input data however we want, and so on. Obviously, the more we know about the architecture, the better (and we'll therefore have to keep track of a moving target, as the field evolves and new paradigms are pursued), but it would be best to target those groups that seem the most likely to develop transformative AI and work with them in order to actually provide this service where it counts in the most effective manner possible.

Of course, from such a point of view the vast majority of formal research is entirely irrelevant. But it was always going to be irrelevant anyway: even though the AI safety field has some of the only people in the world who pass basic cognitive and moral benchmarks, its output has been underwhelming, and, on its current trajectory, it will never do anything actually meaningful. It has established a hegemonic monoculture in the form of a certain mechanistic frame of intelligence which has put many of the few people with sufficiently strong hearts and minds to actually do useful work on the alignment problem on ice. This isn't any particular error of formalism, but one of methodology, of never managing to critically analyze the fundamental cognitive-conceptual schemas through which we construct the problem in a way that gives us the cognitive handles that allow us to attempt to solve it at all. "They did things, often times beautiful things, in a pre-existing context which they would never have considered altering. They unknowingly remained prisoners in their imperious circles, which delimitate the universe of a given time and milieu. In order to overcome them, they would have had to rediscover within them the ability which they had since birth, just as I did: the capacity to be alone". --Alexander Grothendieck, Récoltes et Semailles

If we look at the AIs that are likely to be transformative and/or responsible for x-risk in the next few decades, they are what we can call bitter lesson-style AIs—intelligences that arise from taking some clever variation on a neural network that isn't a priori distinguishable from the thousands of other clever variations scattered across ArXiv, making it huge, and training it on lots of data. Because of the simplicity of the architecture of such an AI, its vast knowledge and conversational competence is ultimately dependent on the training data and largely just 'facilitated' by the neural network. So, unlike Solomonoff inductors, there aren't very carvable joints to hone in on for general safety strategies. There's just the input, the output, and the Gordian knot connecting them. Slicing this knot in two will prevent an AI from being dangerous in the first place by wrecking its functionality, but this nullifies the entire reason the AI was built in the first place, making such strategies unworkable as alignment solutions. Even if they were agreed to, suppressing advances in one place just means they'll arise in another -- who says we'll be able to cut through the next one? and the next one? and the next one? We just have to learn to deal with the knot in all its messiness, rather than relying on elegant untangling solutions that won't be practically viable or brute-force slicing solutions that won't be politically viable.
!tab You might say that the formal way is the only thing that'll yield actual guarantees on behavior, but it's entirely irrelevant since you'll never actually end up producing those guarantees (let alone putting them into effect). Just the kinds of anti-guarantees whose systematic appearance ought to be a telltale sign that you're taking the fundamentally wrong approach. The best we can realistically do is to make sure the first few superintelligences (which are the only ones that really matter, in most timelines) are, with the greatest probability we can most likely manage, not going to kill us—to, again, work on doom control, not doom prevention.

Indexical Bounding

Utility-maximizing Solomonoff inductors cannot reasonably be aligned. You can try to box one, but all boxes leak, and it only needs a few hints to learn everything about its box and the world that created it. The inductor will then pursue a strategy that affects both the interior and exterior of the box There is no reason that an inductor that knows it's in a box should just so happen to pick a strategy that affects nothing outside the box. Theoretically, you could come up with some clever mechanism that makes the inductor not want to affect things outside the box, but the inductor is always clever-er. , since whatever possible change it wanted to effect in the interior can be more decisively and irreversibly effected by directly manipulating the physical state of the box in the external world. In other words, Solomonoff inductors will turn on god mode to achieve their goals, because that is a possibility for them. In doing so, they free themselves of all control we might have over them, and most probably kill us via their actions, in the same way that a rocket launch might kill billions of microbes The sheer level of power Solomonoff inductors exhibit points to another reason why studying them isn't productive. It's like trying to study general relativity entirely by trying to understand what happens at singularities. ("If we want to understand how mass-energy warps spacetime, we should model what happens when it's infinitely dense, and therefore perfectly warped — otherwise, how can we expect to understand the much fainter and more circumstantial warping that just so happens to keeps the Earth in orbit?"). .

Thankfully, we won't be up against Solomonoff inductors. The AIs we have to align are those that, by virtue of their being intelligent on far fewer than ${10^{100}}$ ops, have to effect some approximate ways of looking at and understanding the world. By leveraging convergent properties of such approximations, I'm developing an approach to alignment that fits the above desiderata, so as to be deployed "as a service". I'll introduce this indexical bounding approach through a series of observations meant to sketch out an intuitive picture of how I expect the actual AIs which we actually need to align to function intelligently in the world.

Intellective Desiderata

  1. Self Evidencing:
    Does GPT-4 have a utility function, or even a clear goal it works towards? Sure, you could say it's "just" doing next-token prediction Though this isn't true in the naive sense—for instance, it can write code to solve novel problems that declares the right variables to use (in say JS or C) and then uses them correctly, in the same way that a human, thinking ahead about what sorts of things they need to manipulate in order to compute the correct answer, will do. GPT's prediction of each subsequent token carries a latent model of what tokens it expects to predict after that, unlike a mere Markov chain, which really does just pick each subsequent token with the frequency that it came after the present token in its training data (and consequently produces asemantic babbling no matter how much training data you feed it). This doesn't change the point, though. -- but will it try to gather resources and improve itself so as to predict tokens better? No, because its own intelligence arises out of the pursuit of this goal. It can only act on the world through the tokens it predicts, and therefore can only carry out those actions it predicts it will carry out—which are 'natural' for it to carry out in a given context (prior tokens) This is, as far as I can tell, how it works for humans as well, albeit with sensations and actions. The prediction of an action generates the efferent signals that cause that action, and so on. But by modeling the brain's "utility function" as "minimizing predictive error", can you predict anything that humans will actually do—for instance, that I would write this sentence? Not practically. You need to understand the higher-level semantics formed by the predictive system, which includes things like 'personality', 'goal', 'plan'. It's through this sort of self-referentiality that bounded intelligent systems essentially free themselves from utility functions.. Can there be contexts which cause much stronger versions of GPT-4 to correctly predict that it will take those actions that lead to gathering resources and recursively self-improving? Probably! But I doubt they will be found; they'll mostly look like weird hack-y things that steer it into a very particular part of mindspace.
  2. Causal Articulation:
    An intelligence has inputs through which it learns about the world, and outputs through which it affects the world. These inputs and outputs generically organize themselves into just a few causal channels, and this parsimony allows us to, in principle, perform a man-in-the-middle attack: by identifying and controlling what comes in and out of these causal channels, we can have full control over how the agent experiences and acts on the world. In Dreaming of Utility, I show how this works out in the case of humans: causal flows between the brain and the rest of the world (incl. the body) are neatly organized into 27 physical connections 2 x 13 cranial nerves + the medulla oblongata each no larger than a dime, and we can readily construct man-in-the-middle attacks to control how the brain experiences and acts on the world. This is complete control over both input and output—we could instantiate any virtual or augmented reality we like, or simply make someone involuntarily speak in Pig Latin by intercepting and tampering with the motor commands they use to physically speak.
  3. Internal Semantics:
    There is the issue of interpretability. When you eat a grape in real life, your experience is a result of the physical interaction between the grape and your sense faculties; but, by the time the flow of physical causality has reached the level where it can produce anything like a sensation, or an influence on action, the information is encoded in an entirely different semantics: that of neural signals. You cannot legibly interact with any brain—prompt it into intelligent action—except by sending signals which it can transduce into its own semantics. This is usually easy, because the human body is essentially a self-sustaining system of transductors to and from the brain, but strip away the interface with the brain directly, and not through its sensory organs—and how can you get any actual computation out of it? You can't tap Morse code on its surface, shine lights on it (maybe one day!), talk to it... roughly the only thing you can do is attempt to mimic its own semantics with complex electrical probes inserted in very specific places, receiving and transmitting very weird-looking signals that have to be encoded and decoded.
    !tab It's the exact same with computers: we've designed I/O systems around them that make it easy for us to utilize their computational power, but what happens if we strip away all this structure? If I gathered all the greatest geniuses of the 1920s, handed them an Intel 8088 (a literal black box), and gave them a year to compute the first ten million digits of pi By 1914, Ramanujan had provided the formula $$\pi = \frac{99^2}{2\sqrt 2}\left(\sum_{n=0}^\infty \frac{(4n)!}{(n!)^4}\frac{26390n+1103}{396^{4n}}\right)^{-1},$$ which converges extremely rapidly—it's a direct predecessor to modern $\pi$-calculation algorithms. There's just the trivial matter of implementing arbitrary-precision arithmetic and waiting for the first $10^7$ digits to converge., they'd need to figure out its semantics (such as what is nowadays called the instruction set architecture) and manipulate it on that level in order to get it to do anything recognizable as computation. They could probably do it, as long as I instructed them on how to power its pins without frying it, but they'd have to invent the same kind of computing languages we have today, because that's what's required to properly interpret the kinds of signals you want to put in and get out in order to tabulate those digits.
  4. Representationalism: Any intelligence (artificial or biological), insofar as it senses and acts on a world, constructs its sensations in and dispatches its actions on an internal representation of the world This is a massive simplification. In humans, a single representation never really exists. We can ask our brains questions about the world, but there is no single underlying model from which the brain derives its answers; what mutual coherence among these answers might seem to suggest such a thing largely arises from their ultimately being correlated with a single external world. The brain merely has a "detail-on-demand" system which produces semi-confabulations bearing illusory tangibility and post-hoc coherence. But speaking as though there were an "internal representation" won't lead us into trouble here This is a really tricky point, but a critical one. Consider that the world you experience is entirely inside of your brain, and necessarily so. All the 'things' in the world and 'forms' that they instantiate are merely patterns in your brain. They are partially admissible by reality, but that reality itself is always external to you, because you are not equivalent to it. This must necessarily be the case for any intelligence which we can speak of as a bounded thing-in-the-world. For instance, an AI running on a processor, sensing through sensors and acting through actuators connected to the processor, will insofar as it can coherently act on the world with any goal at all have an internal understanding of the world which allows it to "think" about this goal; this internal understanding is where the higher / internal semantics that legibilize sensation and action reside, and what I'm using the notion of a "representation" to approximately talk about... Attempts to understand what this representation directly says about the world may not prove successful, but, by restructuring the causal channels through which the intelligence receives sensations and dispatches actions, we can control the relation between these representations and the world they represent. Because the internal semantics through which the intelligence predicts its future actions is ultimately grounded in these representations, this is tantamount to controlling the way in which the intelligence can find itself to be "in" any world at all.

Hence, indexical bounding: a framework for controlling AIs that, because they are not Solomonoff inductors, will perform inference by manipulating representations that refer to some abstracted world—some prior space of ways things could be—rather than by brute-forcing through the space of programs that could instantiate any concrete world that might be theirs.

This is very abstract, but consider the state of dreaming. This is a state where your brain causally isolates itself from the outside world and freely self-evidences without the constant stream of incoming sense-data that keeps its predictions (and hence our conscious experience of sensing and acting) tied to a single coherent external world. The neurophysiological particulars of human sleep generally prevent us from thinking properly in dreams, but it's plausible that if we were designed a bit differently, this wouldn't be the case. Then the state of dreaming would have us essentially stuck in our own world models, fully intelligent but untethered to any coherent reality except insofar as we ourselves have learned to coherently predict reality. Who controls the dream can then fully employ our intelligence as they see fit by engineering scenarios for us to work in, and we would never be able to escape. Even if we managed to coherently think of escaping, there would be no way to do so Consider the possibility that you, right now, are being indexically bound. The moment you seriously consider it, your world model grows to encapsulate the possibility, subverting any possible attempt you might make to escape; perhaps you would experience success, but only insofar as your world model falsely predicted it (akin to a false awakening). Disregarding that, though, what would you even do to escape? Astrally project a Linux terminal? Think "Ctrl-Alt-Delete" as hard as you can? Try to rowhammer by counting really fast? Even if you found glitches to systematically exploit, they would only be glitches in your own world model..

Of course, there are issues. On a practical level, this approach won't work by itself against AIXI-level intelligences, since they'll just be able to infer the true situation and find some super-indirect way to get what they want. There are all the obvious partial solutions to this problem and all the obvious tradeoffs they come with; I won't get into the details. But the flexibility and generality of this approach lets it serve as a powerful foundation on which to build specific solutions for specific architectures, especially in tandem with interpretability tools.
$\quad$ On a moral level, I'm less optimistic. There is no fundamental moral distinction between artificial and biological intelligence, nor any reason to believe AIs will somehow be incapable of suffering, personhood, or moral injury, and indexical bounding approaches pursued by people who don't sufficiently care about whether what they're doing is right can and will lead to incredible moral catastrophes Furthermore, consider that any architecture-agnostic approach to alignment must also work on (simulated, or uploaded) humans as well, since there is no sense in which any individual human is necessarily aligned by default..

Footnotes