Abstraction's End
home page
original site
twitter
thoughts
about
Linked TOC

The Structure of Optimization

Version 0.1, posted Nov 27 2023 by Mariven


This essay is worth reading for those who suspect that some fundamental changes to the way we think about problem-solving in general must happen in order for us to understand how the artificial intelligences of the near-future will solve the problems they're given, and to figure out how to robustly control the ways in which they solve problems so that their actions stay consistent with our values. To this end, I'm going to sketch out a cognitive framework with which to think about problem-solving in general, and which I believe is capable of helping us solve the problem of understanding problem solvers.

Introduction

!mar{geometry is a language for thought} In essence, what I'm trying to do here is link the theory of behavior, which splits into theories of inference (resp. modeling, perception, sensation) and optimization (resp. control, action, motion), with the theory of geometry. Throughout human history, geometry—the study of spatially extended objects—has served as not just a tool for thought but an entire language in which to think, a bed on which to intuitively present formalisms which would otherwise lie beyond our reach:

!tab What I'm trying to do here is add another point to this list, by showing how the formal study of behavior—that is, the theories of inference (modeling, perception, sensation) and optimization (control, action, motion)—can be reconceptualized as fundamentally geometric. The main point of contact between the two is the notion of a configuration space, which translates spatial extension into differential determination. When this metaphor is generalized in line with the fundamental formalisms of statistical physics, so many powerful approaches, deep questions, and fundamental insights fall right out of the resulting geometric approach to inference and optimization. In particular, I'll demonstrate a method for identifying instrumental values in arbitrary optimizing systems, using geometric intuition to immediately understand why and how they arise from prima facie unrelated goals.

It's useful to have a background knowledge of many areas of math and physics, especially Bayesian inference, statistical and quantum mechanics, and differential geometry; though, if you've never encountered the notion of a configuration (or state, sometimes phase(some fields of physics draw a clear distinction between state space and phase space, but it's not relevant here)) space, or if notions like 'conceptspace', 'mindspace', and so on have never come naturally to you, you'll have to leap across some large inferential gaps.

!mar{breakdown of parts}   Parts A and B introduce the "worldspace" metaphor and show you how to use it to turn your basic physical intuitions about space and movement into propositions about intelligent agents. Part C derives several basic instrumental values through this method, and shows how they can make intelligent agents inherently threatening. Part D concretizes this, building a threat model that describes many 'generic' ways in which AI can lead to ruin in the next few decades. Part E is a supplementary deconstruction of the notions of intelligence and agency via comparison to biological systems, and demonstration of how to rebuild them as tools rather than shackles via comparison to the mathematical treatment of "space". This might seem overly philosophical, but I'm really just trying to be mindful of how we conduct our thoughts, since almost all seemingly insoluble problems arise from the most basic and "natural"-seeming consequences of our conceptual frameworks (see e.g. most paradoxes in mathematics and philosophy)—they're illusions that we create for ourselves, and get trapped by until we stop looking at the problems themselves and look at the way we look at the problems. When constructing new cognitive frameworks, new ways of looking at things, it's best to figure out how they work as soon as possible, so as to detect and correct for their inherent flaws.

!mar{how to read} How to read: Left-hand annotations indicate the import of certain paragraphs—the specific concept being introduced, point being argued, etc., and contain paragraph numbers (displayed on hover) to make referencing easier. Footnotes, which are formatted like [5], contain often important extensions to the main text; sidenotes, formatted like e, contain first-order corrections, specifications, and other asides; leads, formatted like Q5, ask questions that point to worthwhile future investigations. Extra notes, hidden by default but formatted like [C], are for digressions that get really into some less-than-relevant weeds. Mouse over green text to see information related to that text—this allows me to do cool stuff like this: !c{$R_{\mu\nu}$}{Ricci curvature $R_{\mu\nu}=R^{\sigma}_{\mu\sigma\nu}$}$-\frac12$!c{$R$}{Scalar curvature $R=R_{\mu\nu}g^{\mu\nu}$}!c{$g_{\mu\nu}$}{spacetime metric}$=$!c{$\kappa$}{Einstein's constant $\kappa=8\pi Gc^{-4}\approx 2$e$-43\ \operatorname{N}^{-1}$}!c{$T_{\mu\nu}$}{stress-energy tensor $T_{\mu\nu}=-2(-\det g)^{-1/2}\left(\delta S/\delta g^{\mu\nu}\right)$}..
Note: This document was welded together from research notes and illustrative sketches(Unless other specified, all images were created by me), and I haven't had time to rewrite it in simpler, more consistent language. It will be cumulatively updated with additional material over time if there seems to be any point to doing so publicly—hence the version number. Right now, the main body doesn't extend beyond the beginning of part C, but some images, notes, and resources are attached afterwards as a demonstration of the kind of material that I am privately working on.

A. Navigating World States

To human conceptual cognition, the world we find ourselves in has a single actual state, and many ways in which it could possibly be differentThe notion of a 'possible world' is fictional—everything that is, is because of a chain of prior events that not just caused but necessitated it. (The Many Worlds Interpretation doesn't violate this so much as expand it). But concepts are fictional in general, and thinking of possible worlds is often a useful way of reorganizing these fictions.. The current state of our world represents a single point in this ocean of possible worlds, and individual dimensions of this space are particular ways in which the world can be different. As I type this, for instance, the point representing our world oscillates rises and falls along several dimensions in succession, as the keys on my keyboard rapidly change their elevations—being a part of the world, any change to them induces a change to the world. Each of the individual plastic molecules is a part of its key as well, and as these are steadily shaken up and scraped off, the state of the keyboard slowly and chaotically drifts in a ridiculously large number of dimensions. But to consider the state of every molecule in a macroscopic system would overcomplicate any realistic model of that system without adding anything useful; we have to pick a useful level of detail to work at, a sort of threat model which tells us the kinds of correlations of movements we're interested in.

!mar{concepts of worldspace and worldstate} !tab In the case where we're limiting that part of the world under our consideration to just the (say 56-key) keyboard, we might consider that while the zillion constituent molecules are always moving around on their own, there are 56 different kinds of large, correlated movements of molecules (corresponding to individual keys that might be pressed), making the space of states essentially 56-dimensional if we blur our eyes In case it doesn't go without saying, you shouldn't try to visualize all 56 dimensions. If you want to visualize how some particular movement or group of positions in this space might look, limit yourself to that which can be visualized using only a two or three-dimensional slice of the space. The illustrations below will be 3-dimensional at most, and yet the qualitative features they illustrate generally apply well to higher-dimensional spaces as well. All you have to do is identify these features, and then extrapolate them—"imagine $n$-dimensional space, and then just set $n=56$". Nevertheless, there are some strange features that only appear in high-dimensional spaces—mostly concerning distances and angles—which will be mentioned when relevant.. We say that the vastly larger, more accurate space is fine-grained, while the simpler space obtained by making some conceptual simplification is coarse-grained. Almost always, we'll take a the coarse-grained view, in which we only care about those collection of patterns of change which are relevant to our analysis of some particular problem. In any case, if you imagine the space of possible worlds as a single plane, the conceptual nature of this plane remains the same: distinct points on it will correspond to tangibly different 'states', a displacement vector is an alteration of state, a path from one point to another is a continuous change that effects some alteration, and so on. This plane is the worldspace Why not just call it "the state space"? First, because there are many different constructions of state spaces for different conceptual purposes, and we'll mix some of them later on; it's useful to keep the concept of worldspace (not "a" worldspace) distinguished. In any case, much, but not all, of the discussion of worldspace will generalize to other state spaces. Second, and more importantly, worldspace does not just refer to a conceived space, but to an entire conceptive framework for understanding the local and global structures of this space, and therefore for predicting how goal satisficing and (more generally) optimization will lead to specific patterns in the world. , and any single point on it a worldstate.

!tab In statistical mechanics, we'd call the fine-grained version of this structure the configuration space, and its points microstates. Two points might be different as states by virtue of the tiniest shift in the position of a single atom—in other words, the identity of a point tells you everything about that world down to the micro-scopic scale In fact, quantum (quantized) mechanics tells us that we can take the microstates to be discrete/countable rather than continuous/uncountable, and, in a bounded system, to just be finite (but enormous). This is the approach that e.g. Pathria's book on statistical mechanics takes, giving the resulting theory a crisp combinatorial flavor. But, whenever possible, I'll stay away from picking between discrete and continuous, or between finite, countable, and uncountable, since the qualitative behavior of operations on state space is often independent of these features For instance, you might expect the time evolution of a certain continuous region of state space to be able to expand its volume $V$, while being unable to expand a set of $N$ discrete microstates into $N+1$... but usually we're only interested in this volume insofar as it possesses a probability density $\rho$ and therefore a measure $\int_X \rho\,dV$; several theorems, many of them called Liouville's theorem, tell us that time evolution conserves this measure, which generally allows us to think of continuous measure and discrete numerosity in the same way. . I also haven't fully specified what exactly a microstate consists of: "state space" and "configuration space" are usually used to indicate that a microstate specifies the precise position of every atom, while a microstate of a "phase space" usually specifies position and momentum. Again, though, this will largely be irrelevant. . Our coarse-graining procedure quotients flattens this microscopic information by computing some very lossy function of it that we actually care to think in terms of. For instance, if you're drinking a cup of coffee, it never crosses your mind to ask what this or that particular atom in the cup is currently doing; instead, you care about things such as the temperature of the coffee and whether the cup is full or not. Even though these coarse-grained features supervene on the exact atomic details—you can't make a cup of coffee full or hot except by changing these details—they do so in a stable, smooth, and predictable manner that prevents us from having to worry about whether e.g. some atom might suddenly accelerate to $0.99995c$ or fill up the entire cup by itself. There are so many particular arrangements of atoms that could yield a full cup of coffee at 120°F, but we don't want to control each and every atom, we just want hot coffee. In statistical physics, these coarse-grained, macroscopic state specifications are called macrostates.

!mar{coarse-graining is a necessary error} If you want to object to the existence of some !c{discrete set of values}{i.e., any particular coordinate system} such as {is hot, is full}, or at least to our ability to find some quantitative algorithm for figuring out how to delineate them in a way that exactly matches our wishes... you'd essentially be correct to do so. The assumption of a set of such values, which is implicit in our coarse-graining, will come back to bite us. But working with microstates directly is just not possible—treating every one of the $\approx 10^{25}$ molecules 1 cup coffee $\approx$ !c{$250 \operatorname{cm}^3$}{volume in a quarter liter $\approx$ 1 cup} $\times$ !c{$1 \frac{\operatorname{g}}{\operatorname{cm}^3}$}{density of water} $\times$ !c{$(16+1+1)^{-1} \frac{\operatorname{mol}}{\operatorname{g}}$}{(inverted) atomic weight of water} $\times$ !c{$(6\times 10^{23} \frac{\operatorname{molecules}}{\operatorname{mol}})$}{Avogadro's number} $= \frac{250}{3}\times 10^{23} \approx 10^{25}$ molecules as something to add kinetic energy to separately from the rest is both ridiculous and practically impossible. Our conceptual approximations are our undoing, but we'd be immobile without them—this is an unsolvable problem, and the only way around it is constant vigilance. "The" space of macrostates is a convenient fiction, a conceptual fabrication that would be entirely foreign in nature to what it is now had a few of the unconscious intuitive choices we constantly make gone slightly differently. It's only a tool through which we can use our spatial-kinesthetic-geometric intuition to think about intelligence This is the reason I'm adamant on making the figures 3D, even though it takes me five times as long—it allows us to get the most out of these intuitions. . I've gone to great lengths to integrate this tool with the actual formalisms of statistical mechanics, since it integrates very naturally and effectively with these formalisms, but, if you find this unhelpful, it's better that you not pay attention to my speaking of microstates, macrostates, and entropy; the cognitive framework takes us very far even if we use it naively, because it has a strange sort of ideatic ability to transcend the limitations of any one mind In fact, my use of the worldspace formalism to derive results on instrumentality came before my realization that statistical mechanics was such a useful way of grounding it in the kind of mind-independent-coherence that's needed to resolve the forever-wars cognition keeps getting dragged into on account of its searching for the truth of things even though it is not of the right form to cognize such truth.
  A second thing evincing this transcendence is that I could've just as well provided such a grounding using the entirely different language of differential geometry, speaking of tensor fields, coordinate-independence, and holonomy rather than microstates, entropy, and ensembles. Pseudoriemannian geometry is already built to articulate the nature of the ortho-spatial expansiveness that we call time, and we could get a lot of mileage out of treating worldspace using geometric results from mathematics and physics (geodesics, Fermat-type and more generally variational principles, blah blah blah). This'd give us a new way of looking at the situation that could produce the same results as we obtain from the point of view of statistical mechanics but in all likelihood would end up showing us an entirely different perspective. I do intend to discuss how all of this applies to worldspace in a future essay, since the novelty of the perspectives thereby provided does obtain, rendering it very useful. E.g., many times throughout this essay I'll talk about the "topography" of worldspace, and that isn't nearly as easy to conceptualize in the language of statistical mechanics as it is in the language of differential geometry.
  Much later on, I'll explore what it is about reality, and about us, that should let such a conceptual structure like worldspace "participate" in the kind of transcendence that lets us use it to think about reality at all. If by the time you manage to reach that part this new level of abstraction makes you want to puke, feel free to do so. You've earned it.
.

!mar{outcome pumps and edge cases} C.f. The Hidden Complexity of Wishes—the non-alignment of the genie therein is essentially a consequence of the facts that (a) there are a massive amount of microstates corresponding to a single macrostate, (b) it is extremely hard to demarcate beforehand the space of said microstates, since there's some level of micro-level specificity at which you just have to wave your hands and say "I don't know, that's kind of a weird edge-case...", and (c) the edge-case-space grows larger with the number of relevant macro-level factors In high-dimensional spaces, shapes tend to be "shallow". We can demonstrate this by doing the thing every machine learning book does—demonstrating the unintuitive properties of these spaces by studying their spheres. The volume of a ball of radius $r$ and dimension $D$ is given by an equation of the form $V(r,D)=r^Df(D)$, so the proportion of points of the unit $D$-ball within a small positive distance $x$ of its boundary is $p(x,D)=1-\frac{(1-x)^Df(D)}{1^Df(D)} = 1-(1-x)^{D}$.
  Geometrically, this is how much of the ball lies within the outer shell of thickness $x$; probabilistically, it represents the chance that any random point sampled from the ball will turn out to be an edge case—we can think of it as a danger zone, and $x$ as a parameter controlling the size of the danger zone, a.k.a. our vulnerability. In this latter sense, it's clear that we want to keep this probability as small as possible. Usually we think to prevent $x$ from rising by making ourselves stronger or safetyproofing the environment—but an increase in $D$ will elevate the probability just the same! (For $y=-\ln(1-x)$, which is near-identical to small $x$, $\frac{\delta p}{\delta \ln y} =\frac{\delta p}{\delta \ln D}$, so that a small multiplicative change has the ~same effect when applied to either $D$ or a small $x$).
  To put it in other words, points in a high-dimensional sphere get arbitrarily close to the edge as the dimensionality increases. This is tricky to intuit geometrically, but perfectly clear logically: to be close to the center, you have to be close on every single axis—in this way, closeness is a $\forall$ proposition—whereas being close to the edge requires you to be close on just one axis—non-closeness is a $\exists$ proposition. The more axes there are, the more ways there are for things to go wrong; a single fatal coordinate in an array of safe coordinates renders the entire vector fatal. In the context of safety, we're playing on the $\forall$ team, and the enemy is playing on the $\exists$ team. When the set of potential vulnerabilities is large, $\exists$ wins by default.
, of which there are always many more than you think.

!mar{finite paranoia is insufficient}   In the context of that piece, we might, for simplicity, imagine a criterion like "my mother should be in a safe place outside of that burning building" as being discretized into three macrostates (safe, in danger, roasting), but the underlying reality actually admits an ocean of microstates, many of which straddle borders between macrostates in ways that never even occurred to us. In most of the possible relocations from inside to outside, something goes wrong that you never even imagined would go wrong: there are millions of axes along which things can go wrong, whereas your limited imagination only leads you to cognize two, three—maybe fifteen if you're absurdly paranoid—of these axes.


When your coarse-graining fails to align with your exact interests, things don't go how you imagine they will. Red (fatal) seems straightforward to avoid, but when considering how to move from unsafe (yellow) to safe (green), a predetermined solution is far more likely to fail than a continuous process of monitoring and control.

Paths as Solutions

!mar{plotting paths through worldspace} If you take an agent, imprint on it some particular goal For brevity, we'll abbreviate such an Agent with Goal Imprinted as an "AGI". What exactly is meant by "agent" doesn't particularly matter yet—just imagine some being that freely acts on the world in order to achieve some goal. , and give it the capacity to modify the world, it will start modifying the world in pursuit of that goal. This is the implication of an entity's being an "agent". We can picture it as attempting to get from one point in worldspace to another point where this goal is satisfied; controlling for the rest of the world, it will trace a path from this world to some goal world, this path characterizing the exact approach the agent took to achieve the goal. Plot time on the z-axis against the world-state on the x-y plane, and you'll get a graphical representation of this approach:


(you're looking down on the cube from above)

!mar{goal-based vs utility-based navigation} The yellow point is not the goal; there are many different worlds that instantiate the goal, or which in any case the agent is indifferent between, and the yellow point just represents a single such world. Let's suppose that the green square is the collection of all worlds in which the goal is achieved, hence the yellow point's lying on it. (Why aren't we formulating this in terms of utility? See: Truthfully, we could just as well have deployed the concept of a utility function $u$ over worldspace ${\cal W}$ in lieu of a set ${\cal G}$ of goal worlds. This would be more general, since any such set can be replaced with its characteristic utility function $\chi_{\cal G}(w) =$!c{$[w \in {\cal G}]$}{(Iverson bracket)}. It's also more flexible: we might choose $u$ to be time-independent, or time-dependent in a known or perhaps merely inferrable way. We might choose it to be a function of the entire path !c{$p$}{$p(t_0)=$ starting world $w_0$, $p(t)=w$}$: [$!c{$t_0$}{starting time}$,$!c{$t$}{current time}$] \to $ !c{${\mathcal W}$}{worldspace} taken to get to the state That is, the "function" is perhaps some more complicated sort of equipment, e.g. a !c{functional}{operation sending paths to scalars, e.g. $I[p] := \int_{0}^T ||\dot p(t)||^2\,dt$}. Two natural families of cases in which we could expect such things are (a) utility functionals which depend on the shape of the path (e.g., we tell SurgeonBot not to proceed too rapidly, so if it screws up we can react quickly enough to override it), and (b) utility functionals which value certain transient, vanishing things (e.g., NurseBot cares for the terminally ill, and sees utility in comforting them today even though they'll be dead all the same tomorrow). , like $U[p] = \int_{t_0}^t L(p(\tau), \dot p(\tau), \tau)\,d\tau$ or merely a function of any particular world, like $U[p]=u(p(t))=u(w)$ The first class consists of the utility functions, or maps from ${\cal W}$ into a poset, which for whatever reason we choose to be ${\mathbb R}$. The second class consists of the utility functionals, or maps that take maps $[t_0, t] \to {\cal W}$ to elements of ${\mathbb R}$ again. If we call the former ${\cal U_1}$ and the latter ${\cal U_2}$, we have a canonical injection $\iota: {\cal U_1}\to{\cal U_2}$ with $\iota(u)[p] = u(p(t))$. A functional evaluates a present time and its history, so "get the current value of this" is an operation that turns functions into functionals. So ${\cal U_2}$ is really an enlargement of ${\cal U_1}$ given by allowing utility functions operators to take history into account. Are there further enlargements, whether in the form of a string of inclusions ${\cal U_2}\subset{\cal U_3}\subset{\cal U_4}\subset\ldots$ (or some more general posetal structure), or are we all out of features to take into account? I see a couple of contenders for what ${\cal U_2}$ could be, but I feel like there should be a good and retrospectively obvious answer which naturally suggests a family of further enlargements that end up encompassing these contenders.
Abram Demski discusses similar issues in An Orthodox Case Against Utility Functions, asking: what is a utility function a utility function of? AD goes on to outline an approach to utility that ditches entirely the notion that utility need be a function of a world: a toy theory using the decision-theoretic Jeffrey-Bolker axioms sketches a notion of utility which just depends on events, with a "world" merely being an event that describes an entire world-state AD makes a very interesting analogy: if reasoning about (fine-grained) worldspace is like ordinary topology, reasoning about events is like locale theory (pointless topology). . That I speak not only in terms of an explicitly coarse-grained worldspace but in terms of reasoning about the "topography" of this space means that I can't update much from the arguments it makes, since I've kind of taken an orthogonal path to a parallel conclusion, but it's a very interesting way of thinking about the underlying ontologies of utility functions.
NurseBot seems like an important example. When utility is solely a function of worldstate, "all's well that ends well", but suffering seems intrinsically bound to changes of worldstate: if you shoot someone in the leg, deep-freeze them two seconds later, keep them that way for ten years, restore their leg to its original state, and then unfreeze them immediately afterwards, their subjective experience will be like a sharp flash of pain that disappeared as quickly as it came, right? given that the actual physical activity which consciousness arises from, or at least couples to, has been stopped? In other words, suffering seems to be inherently time-dependent, so that treating of it requires picking out points not in worldspace but in its path-space. What other reasonably-natural utility function correlate "Suffering" can never be a utility function in itself or even an input to one; a utility function can only ever depend on suffering by treating of its inputs by structuring them in a manner homomorphic to the manner in which actual suffering arises from them, i.e. by depending on a constructed image of suffering. Hence, correlate. I guess in many cases this particular one would be a proxy as well, since suffering is something humans want to alleviate in themselves and cause in others, but we'll stick with the more general 'correlate'. satisfy this property? Are they characterized by common properties or origins? Do optimizers of this kind of utility function end up acting in common ways unique to the kind? Are there utility functions which intrinsically depend on higher-order derivatives? .
  So why not utility? Four primary reasons. First, as the length and intricacy of the above paragraph ought to indicate that, if I took the time to conceptually unpack the phenomena evident in utility-based navigation the same way I'm doing for goal-based navigation, this essay would be an order of magnitude longer and denser than it currently is. It's certainly possible, and I could have a very fun and productive couple months exploring and writing about it, but my bank account says that won't be any time soon.
  Second, modulo the question of how to make and act under uncertain inferences about how the !c{utility landscape}{more commonly 'loss landscape', but same difference} looks behind the "fog of war", utility-based navigation is essentially the differential analogue of goal-based navigation, and I think this analogy acts functorially for the most part, with extensive/global features of goal pursuit translating into intensive/local features of utilization. Discussing the latter requires appealing to tricky variational arguments and optimality principles that are far more opaque than their analogues in the former.
  Third, what's inside the 'modulo' above is pretty complicated both for goal-based navigation and utility-based navigation. If we're at a point in worldspace where we can't simply follow our utility function to its peak—perhaps we're in a valley of equal low utility, or, worse, a peak that is clearly only local, and whose local topography does not give us any information as to where a higher peak might be—in such a case, we have to intelligently extract information from our model in order to determine where to go in just the same way that we would if we were trying to figure out how to get to a goal world.
  Fourth, what we actually care about is often more easily expressed in terms of goal sets than in terms of utility functions You exclaim "but you yourself said that goal sets could be replaced by utility functions!"—yes, with a wrapper that directly points the utility function at the goal state. That informational structure doesn't disappear just because we decide to call it something else. , and using the latter just shoves the process of goal operationalization under the rug.
). Note that this isn't necessarily a time-independent set, since goals can be entangled with the flow of time in nontrivial ways. Some examples: (a) an approach to the goal "make money by trading stocks" may work at one time and not another; (b) the goal "wake me up at 8 am on Monday" is confined in time; (c) the goal "surveil the world's richest person" has a !c{de dicto time dependence}{The time dependence is a property of the pointer "world's richest person"}, and (d) the goal "keep this wall painted grue" has a !c{de re time dependence}{The time dependence is a property of the thing being pointed to}. We will try to assume, though, that the coordinates of worldspace don't themselves change over time, so that a (macroscopically) static world would trace a perfectly vertical line.

!mar{characterizing paths by states at a time} !tab Not only are there so many goal worlds to choose from, but—even if we fix a specific goal world, say the yellow point—there are so many different methods to turn this (cyan) world into that (yellow) one, so many non-intersecting paths. For simplicity, we'll distinguish different approaches via their progress at halftime, marking this intermediate world in magenta:


Inserting an intermediate sampling layer as a measurement of the path, or method, by which the goal is reached

!mar{parametrization of time is arbitrary} Just as the green square on the topmost plane represents the agent's goal, it'll be useful to think of any particular subgoal as a shape drawn on the halftime plane, covering the collection of worlds in which it is satisfied. That some approaches may achieve the subgoal at different times doesn't matter conceptually: reparametrize time to even them out. But no matter how much we reparametrize, a single point won't convey everything about the path it came from. Not only are there multiple intermediate worlds one could choose to achieve on the way to their goal, but there are multiple paths through each particular world. In each case, I've illustrated an infinitesimal proportion of the whole situation:


!mar{coarse-graining affects dynamics} If you're wondering why paths which represent the same world at the same time wouldn't just merge, given that each world contains the seeds of its own future evolution, recall that they only represent the same world on some macroscopic level, rather than having the exact same microstate. There are many such microstates corresponding to any macrostate To be precise: given a macrostate $X$, let $\Omega(X)$ be the volume, !c{or number}{again, trying to stay continuity-agnostic}, of microstates $x$ which we identify with $X$. Then, $\ln \Omega(X)$ is proportional to the entropy of the macrostate $X$; fixing the constant of proportionality to Boltzmann's constant !c{$k_B$}{$\approx 1.38 \times 10^{-23}$ joules per kelvin}, this is often taken as the definition of (thermodynamic) entropy. Why the logarithm? Consider that $X$ will correspond to a space of microstates which looks like a carving from a cube with a dimension for each microstate parameter; the volume of such a carving grows to zeroth order like the product of the ranges of each microstate spanned by the macrostate (i.e., the volume of the macrostate is approximable by the volume of its bounding cube in microstate space), and a first-order correction can be made by multiplying this volume by a constant representing the compactness/density of the carving, thus correcting for 'gerrymandered' macrostates with artificially large bounding cubes. So the logarithm of this volume looks like the sum of the logarithms of the ranges of each microstate parameter compatible with the macrostate plus a (negative) constant representing the logarithm of the density of the 'carving'. The effect of this is to undo the combinatorial explosion in $\Omega(X)$ that comes with having an incredibly large number of dimensions. What is left is information-theoretic: $\ln \Omega(X)$ is proportional to the number of bits needed to specify a microstate from a macrostate, or the amount of information that the coarse-graining erases when it describes a system in microstate $x$ as being in macrostate $X$. , and plenty of chaos to allow worlds which are only different at the microscopic level to diverge C.f. You Can't Predict a Game of Pinball, which demonstrates how rapidly microscopic deviations can snowball into macroscopic effects. When we identify these effects as being endogenous, we say they result from volition; when we don't, we say they result from chaos. . The point of the concept is to analyze choices, the existence of which is dependent on exactly the same kind of black-boxing of details on which the notion of a possible world rests. Concepts like "volition" and "possibility" are fictions that intelligent beings resort to when they can't do Solomonoff inference.

!mar{worldspace is a maelstrom} !tab That being said, we ought not to black-box those details that choices critically depend on (only those details which choices reduce into), since the navigation of worldspace depends on how an agent responds to those choices. Worldspace isn't the homogeneous white plane I've depicted it as: to an agent present in the physical world, each point will be a maelstrom of constant change, with those who do not drag the world from point to point being dragged about by it instead. Every world-point is chaotic in its own way, and presents different affordances for, or obstacles to, change in any given direction. So, the best route from one point in worldspace to another won't have a closed-form solution, just as the path of a ladybug trying to escape a hurricane doesn't—because there is no algorithm that can be written down, trying to approximate this route is a messy craft that involves taking what you can get, hedging your bets, preserving your freedom of movement, trying to infer what options are available in other possible worlds, and all the other general hallmarks of intelligence.

!mar{geometrically intuiting intelligence} !tab If the goal-imprinted agent introduced above (AGI, or Agent w/ Goal Imprinted) is to be successful in this craft, it will need to possess its own kind of intelligence. The concept of worldspace is a way to leverage geometric intuition in order to reason about this intelligence in terms of how it acts on the world.

!mar{turning approaches into trajectories} Seriously try, if only for a minute, to picture how progress towards a goal translates into the charting of a trajectory through worldspace. Compare it to a real-world example in order to get a feel for it—the most important thing is to have some sort of feel which translates itself into intuition. Go play, say, Street Fighter or chess, and see how your mind constructs a local 'topography' for worldspace off of which it reads the relevant considerations and salient action plans at each point in time. Discretize it if you like, with a bunch of individual points each of which contains routes to other points slightly ahead in time (recognized affordances), and equip each route with some information about its traversal such as the amount of risk incurred thereby. Like a weighted DAG. Then you can formulate scenarios such as the one below, whose narrative is as follows: an agent decides to take a risk-lowering action that makes the goal slightly more distant, under the assumption that there'll be plenty of future actions that make up for that distance while keeping risk low, but no such opportunities show up. The path-of-least-resistance it's accidentally put itself onto is one where the distance to the goal grows and the cost of correction grows even faster.


feels bad man

B. Local Pathfinding

!mar{innate dynamics of worldspace} Imagine we were habitually indisposed to consider the state of the actual world as being a point in worldspace, and to conceptualize given goals as "moving the worldstate from here to there". What would we notice as we tried to consciously move about this space—what are its dynamics?

  1. (Chaos) The world changes even if you don't do anything to it, due to other agents and natural processes. Sometimes nations go to war; sometimes mad scientists take over the world; sometimes the sun explodes. Unless you entangle yourself in each individual situation, it'll kinda just happen. The consequence of all of this change is that navigating through worldspace isn't like walking from your bedroom to your living room, but like zooming around the Himalayas on rocket-propelled ice skates.
  2. (Ease of Movement) At each point, it's going to be easier to move in some directions than others; these need not correlate with closeness to your goal. People modify this easiness by building structures, barricades, tripwires, and other security features on worldspace. If you want to acquire a nuclear weapon—to move to a worldstate where a nuke is in your sole possession—you have several possible methods, but all of the visible ones are very difficult to achieve, since major nations' security apparatuses form a very large, very tight barrier around the space of worlds where rogue individuals have nuclear weapons. To just pierce through requires enough power that there's no point in doing so, and finding a hole to sneak through requires incredible predictive/perceptive ability and the right initial position (or opportunity).
  3. (Intelligence) The locally and directionally dependent ease of movement provides worldspace with a sort of topography of difficulty, or risk: it's like climbing a mountain, where some paths are easier than others, some paths are basically doomed, and some paths you can (or must) take a chance on. Now, it's very hard to see more than a couple (metaphorical) meters in front of you—plans with more than a single step routinely fail, and the option space from a certain point often looks a lot different when you're at that point rather than when you're scoping it out from afar.
    !tab Those who can see through the fog and correctly judge the topography from afar have significant advantages when it comes to moving the worldstate: their multi-step plans are more likely to work; they can plot out a wider variety of paths while minimizing risk and moving towards their goals; they can see and prepare for worldspace's chaotic undulations. Vision, then, is analogous to intelligence People use the word "intelligence" in all sorts of subtly different ways without even realizing it, which is one reason the AI alignment problem has been so difficult—like construction workers building a skyscraper each according to their own unique blueprint. I don't want to claim that clarity of vision here is representative of a canonically universal concept of intelligence; a loose synonym for my use here might be "good judgement".
    !tab I certainly wouldn't have any idea how to use the adjective "general" to characterize this; it seems like I'd inevitably fall into so many perspective traps and accidental motte-baileys that I'd never have any useful thoughts again! Human general intelligence can loosely be interpreted as g-factor—the concept is coherent when speaking of humans since a bunch of primitives for the assessment of intelligence are given to us (we know how to provoke and interpret their movement, speech, perception, cognition, and so on). The concept of 'artificial general intelligence', on the other hand... could you imagine a parallel world where that was what "AGI" stood for? We'd be doomed! hahahahaha
    . But it's never perfect; there are always blind spots, flaws in the lens that cause systematic misperceptions or limits to it that make certain things impossible to seeIt could always be the case that, say, aliens are heading for us at so close to lightspeed that we'll only have seconds to react. It's probably not the case, but there's no reason the world can't be objectively unfair. .
  4. (Death) To admit your physical existence as an agent is to admit the possibility of your physical destruction. No matter how good you are at steering worldspace, there are certain parts of it in which you are not alive, call them dead zones, and if you hit one of those, you won't ever steer again.
    !tab This concern is in its full generality much broader: the world will keep moving with or without you, and death simply zeroes out your ability to steer it. The world can act on you in so many other ways so as to negatively impact this ability, and you ought to avoid all such things with the proper priorities.

!mar{noise is heavy-tailed in worldspace} Now, the random bumps you get in worldspace aren't all the same, or of similar severities: their severities are better modeled as coming from a high-variance log-normal distributionThat is, the distribution of a random number whose logarithm has a Gaussian distribution; since Gaussians are what sums of lots of independent random factors tend to look like (in a manner made precise by the Central Limit Theorem), log-normal distributions are what the products of lots of independent random factors tend to look like. Think about how major disasters tend to happen—many different procedural oversights and systemic flaws coincident with with poor environmental conditions and human errors. It is because the conditioning factors of events combine disjunctively, by multiplication, that the events tend to have severity distributions that look far more log-normal than Gaussian. , owing to the way they're distributed across several orders of magnitude.


100 sorted samples from Lognormal(0,5)

In the above list, the largest value is 166x the value of the second largest, which itself is 16x the third largest These values are from the first trial I ran, no cherrypicking, but 166 and 16 still seem to be very high values for these ratios—trying to pin down what's going on has been very spooky, since it seems that if we treat these ratios as random variables, their distributions grow very heavy tails very fast. Adapting this MSE answer: for $N$ samples from a distribution with CDF $F(x)$, PDF $f(x)=F'(x)$, the distribution of the joint random variable given by sorting the list from smallest and largest and then taking the values at $a$ and $b$, call them $X_{(a)}$ and $X_{(b)}$, where $1\le a < b\le n$, is: $$ f_{X_{(a)}, X_{(b)}}(x, y) = \frac{N!}{(a-1)!(b-a-1)!(N-b)!}F(x)^{a-1}(F(y)-F(x))^{b-a-1}(1-F(y))^{N-b}f(x)f(y) $$ When $b=N-i$ and $a=N-i-1$, $\frac{N!}{(N-i-2)!((N-i)-(N-i-1)-1)!(N-(N-i))!}=\frac{N!}{(N-i-2)!i!}$, so the joint distribution of the $i$th from the right and the one to the left is $$ (i+1)(N-i-1)\binom{N}{i+i}F(x)^{N-i-2}(1-F(y))^{i}f(x)f(y) $$ For a mean zero normal distribution, with CDF $\Phi(x)$ and PDF $\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{x^2}{2\sigma^2}}$, writing $j=i+1$ (i.e., one-based indexing) yields $$ f(x,y)=\frac{1}{2\pi\sigma^2}\binom{N}{j}((N-j)\Phi(x)^{N-j-1})(j\Phi(-y)^{j-1})e^{-\frac{x^2+y^2}{2\sigma^2}} $$ (when $y\ge x$). So the PDF of the difference between the $j$th from the right (starting at 1) and the one to its left, call it $D_j$, is $$f_{D_j}(d) = \int_{-\infty}^\infty f(y-d, y)\,dy=\frac{j(N-j)\binom{N}{j}}{2\pi\sigma^2}\int_{-\infty}^\infty\Phi(y-d)^{N-j-1}\Phi(-y)^{j-1}e^{-\frac{2y^2-2yd+d^2}{2\sigma^2}}\,dy$$ The trick: the exponential of this difference is the ratio between the $j$th largest sample from $N$ log-normal RVs and the next smallest one. $p_{e^{D_i}}(r) \frac{d}{dr}P_{e^{D_i}}(r)=\frac{d}{dr}P_{D_i}(\ln r) = \frac{1}{r}p_{D_i}(\ln r)$, so the distribution of the $j$th ratio is $$ p(r) = \frac{j(N-j)}{2\pi r\sigma^2}\binom{N}{j}\int_{-\infty}^\infty\Phi(y-\ln r)^{N-j-1}\Phi(-y)^{j-1}e^{-\frac{y^2+(y-\ln r)^2}{2\sigma^2}}\,dy$$ Given this, the expected value can be calculated as: $$ m =\frac{j(N-j)}{2\pi \sigma^2}\binom{N}{j}\int_{-\infty}^\infty\Phi(-y)^{j-1}e^{-\frac{y^2}{2\sigma^2}} \int_1^\infty\Phi(y-\ln r)^{N-j-1}e^{-\frac{(y-\ln r)^2}{2\sigma^2}}\,dr\, dy$$ Code that calculates this integral, using a cutoff of 50,000 for $r$ and $(-20\sigma, 20\sigma)$ for $y$.
from math import log, pi, exp, factorial; import scipy.integrate as integrate; import scipy.stats
N, j, s = 100, 1, 5
print(factorial(N)/(factorial(N-1-j)*factorial(j-1))/(2*pi*s**2) * integrate.dblquad(lambda y, r: scipy.stats.norm.cdf(-y, scale=s)**(j-1)*exp(-y**2/(2*s**2))*scipy.stats.norm.cdf(y-log(r), scale=s)**(N-j-1)*exp(-(y-log(r))**2/(2*s**2)), 1, 50000, lambda x: -20*s, lambda x: 20*s)[0])
So a ratio of 166x in this scenario seems to be around 95th percentile (and 16x for the second ratio is 94th percentile, but we should expect the two to be correlated), but the mean ratio is around 500x, or three times higher! The numerical integral has to be carried out all the way to ratios in the millions to get an accurate value, since the mean converges very slowly...
. Things rarely blow up, but when they do, they can get really bad. Sometimes an eyelash falls into your eye; sometimes you spill water all over your pants; sometimes you get a migraine; sometimes you break a bone in a car accident; sometimes you're diagnosed with cancer; sometimes your whole family perishes in a house fire. Each one is probably an order of magnitude worse than the last, but they all happen—and, scarily enough, the probability of each seems only to be inversely proportional to its severity; rather than going to zero as $e^{-x^2}$ like a good Gaussian, it goes to zero almost as $x^{-1}$ Any non-obnoxious distribution has to asymptotically go to zero faster than $x^{-1}$ in order to be normalized—but the log-normal tail just barely does so. Using a mean and variance of $0$ and $5^2$, the log-normal pdf is $$p(x)=\frac{1}{5\sqrt{2\pi}}\frac{1}{x}e^{-\frac{1}{50}(\ln x)^2} \propto x^{-1{\mathbf{-\frac{\operatorname{ln}(x)}{50}}}}$$ The bold term is zero at $x=1$ and grows very slowly: for $x < e^{10} \approx 10^4$, $p(x)$ will go to zero no faster than $x^{-1.2}$, and for $x< e^{25}\approx 10^{-11}$, no faster than $x^{-1.5}$.
!tab As an interesting note, a very similar kind of random walk is often studied in biology: the Lévy flight, whose step lengths are taken from the Pareto distribution (which definitionally has tail $p(x) \propto x^{-c}$ for some constant $c>1$). The Pareto and log-normal distributions are basically the Windows and Mac of power-law distributions—they're by far the most common, and the debate over which one to use in any given situation (apparently!) gets heated quickly.
!tab They each have their own powerful metatheoretical justifications. The log-normal distribution is the limiting distribution of the product of many positive random variables (this is the log-domain version of the central limit theorem), and therefore tends to model quantities whose rates of change scale with the quantity itself. Meanwhile, the Pareto distribution satisfies a slightly different self-similarity criterion: 20% of the population owns 80% of the wealth, 20% of that population owns 80% of that wealth, and on and on.
!tab In any case, the qualitative behavior of a Lévy flight is nearly indistinguishable from that of a log-normal random walk, so the nature of the discussion is unaffected by making a different choice; it seems like a general outcome of heavy-tailed noise It seems that if you randomly scatter objects throughout space with a given density (e.g., scattering chocolate eggs across an infinite field so there's on average one per square kilometer), the search strategy that finds the objects most rapidly (on EV) is the Lévy flight. I half-suspect that this provides a means to metatheoretically explain the ubiquity of heavy-tailed random walks: perhaps the appearance of power law-distributed phenomena is a consequence of the ability of those subsystems of the world that engage in Lévy flights throughout their components of worldspace to reach us more frequently than those subsystems that don't.
!tab Putting it like this, this anthropic hypothesis isn't that strong—nevertheless, I feel like there's something here. I just can't put the guiding intuition that makes me feel that into words just yet, but it's sort of a perspective shift on the principle of maximum entropy.
!tab To make this clearer, let's consider the many-worlds interpretation of quantum mechanics: here, we can think of a probability density $\rho(w; t)=|\Psi(w; t)|^2$ on microstates which shifts around according to Schrödinger's equation in a fluid-like manner It is a consequence of Schrödinger's equation that the density $\rho$ changes in accordance with the same continuity equation that fluids obey. . (Note: insofar as MWI is not distinguished from other interpretations in the empirical predictions arising from its anthropics, it's not a presupposition of this hypothesis, merely a useful way to imagine it. So regardless of whether one takes it to be "true" or not, it's an evocative and useful perspective to have; the justification of the maximum entropy principle below demonstrates this).
!tab So, entropy is the (logarithmic) microstate-density of worldspace, which is why we find ourselves in high-entropy worlds—microstates are what actually obtain, what probability density actually flows between, so the fact that our experience of the world is not that of a microstate implies that it will be determined in a way that prefers those experiences that correspond to more microstates, i.e. that are of higher-entropy worlds. Now, this perspective fixes the worldspace and lets a drop of density flow on top of it, but if we fix that drop and let worldspace flow under it, what worldstates will come to it? If we let the world come to us, what will we see? Ignoring the way in which we cause the world to take on certain states—though this is only an illusion—what I'm imagining is we'll see worlds that arise from microstates that zip through worldspace, which prefers those microstates that so evolve as to perform Lévy flights throughout it. The manner in which this ought to happen is dependent on the precise coarse-graining we perform, but in a manner I'm not too sure of and wouldn't have space to treat of here anyways.
.
.

!mar{being careful about dead zones} !tab If our AGI is serious about achieving its goal, which of course it is, it wants to avoid dead zones—unless it figures that the best method of achieving the goal involves sacrificing itself, but that's extremely atypical. If the rocket skates it's strapped into—the ceaselessly brutal chaos of the world it is subject to—were merely normal (Gaussian), the AGI would be able to avoid dead zones by looking out for and consistently moving away from them. In the diagram below, the blue dot starts at the black cross and is then pushed in random directions for distances drawn from $\operatorname{Normal}(\mu=0, \sigma=3)$; this is its location history after a million steps.

!mar{log-normal walks behave differently} But when the chaos is log-normal, things look much different, since while most pushes will be tiny, you can expect a few massive ones every so often. To illustrate just how vast a change this makes in the phenomenology of the walk, here are two random walks. Each one starts at the marked cross and takes one million steps, ending up at the blue circle. Both walks draw each step angle uniformly, but the first draws each step distance from ${\operatorname{Normal}}(\mu=0, \sigma=3)$, while the second draws from $\operatorname{Lognormal}(\mu=0,\sigma=3)$:

a normal random walk (left) vs a log-normal random walk (right)

!mar{you could die at any moment
(don't turn around)} Clearly, the !h{heavy tail}{} of the log-normal distribution is doing a lot of work. At the left end of this distribution, the tiny pushes, which form the vast majority, again produce random walk–like behavior—but with a lot of steps, you're bound to get a few massive pushes which upend everything. A lot more like real life. Merely ambling away from death isn't a solution in log-normal chaos, because you could at any point be launched right into it. To humans, this looks like—a fatal car crash, sudden cardiac arrest, choking, aneurysm, accidental electrocution or poisoning—or one of the many other occupational hazards of daily human life which can and do kill people entirely out of the blue. The analogous situations for our AGI depend on the nature of its physical instantiation—but, on the very far end of the spectrum, a lot of them are shared with humans: solar flares, global nuclear war, nanotechnological catastrophes, the actions of present or future artificial superintelligences, and so on.

!mar{optimizing movement is often optimal}   If the AGI has more control over the nature of its physical instantiation than we do with ours, e.g. if it is a program capable of rewriting itself, then, if its goal is something absurdly difficult or unbounded in scope (e.g., clone a strawberry to the cellular level, or grant my every wish), it would be optimal for the agent to spend some time optimizing its movement through worldspace To build intuition for this, here's a puzzle: a bunch of identical solar-powered robots are competing in a 100m race. They have an initial top speed of $c$, which they reach near-instantly; by staying still and charging themselves with solar energy, they can increase this top speed by $k$ m/s per second. Now, insofar as 100m and $k$ are large compared to $c$, it's best to invest in speed by doing some charging, but doing too much charging will keep you still for too long to be able to win the race. What policy for deciding when to charge and when to move wins the race?
  The insight that makes this problem trivial is to note that having two separate charging periods is inefficient—it would've been better for you to add the second charging time to your first charge, since you'd get the speed it provides that much earlier. In the diagram below, the blue bot does this to beat its counterfactual red twin, who charges twice separately.
  Therefore, the best strategy is to charge for $t$ seconds at the very beginning and then just run for $\frac{100}{c+kt}$ more. The total time $T(t)=t+\frac{100}{c+kt}$ seconds is minimized when $\frac{dT}{dt} = 1- k\frac{100}{(c+kt)^2}=0$, i.e. when $t=\frac{10\sqrt{k}-c}{k}$.
  For instance, if the robots start off with no energy, $c=0$ m/s, and gain energy at $k=4$ (m/s)/s, the optimal strategy is to charge for five seconds and then run for five more. But if the robots start off at a refreshing $10^{3}$ m/s, $k$ has to be above $10^4$ (m/s)/s for charging to be worth it at all.
  This scenario illustrates some meaningful possibilities for the action of a competent AGI given a sufficiently large goal: first, that it might through simple reasoning figure out the general structure of a near-optimal plan very early on, and, rather than continually self-improving for the purpose of hunting wider and wider, it may self-improve to fine-tune and better execute this path. Second, and more salient, is the point that optimal plans will generically start conducting self-improvement as early as possible. It is a task that, by letting one perform all future tasks faster and better, merits prioritization above all other tasks that aren't explicitly prioritized There are multiple reasons why a given task might be prioritized: it might be a temporary opportunity (profiting off a mispriced stock), might have a deadline (defusing a bomb), might become more difficult later (preventing a pandemic), and so on. .
. What does this involve?
In a word...

Instrumentality

!mar{instrumental behaviors} Instrumental behaviors are convergent action patterns underlying almost all object-level goals, just as "staying alive" is a prerequisite to almost any particular thing you might want to do in life. We might also explain them as "underlying the generic object-level goal", as explained in footnote In set theory, a property $P$ holds for almost all elements in an infinite set $S$ when the set of points where it doesn't happen has a strictly lower cardinality: most obviously, almost all reals are irrational, since $|{\mathbb R}-{\mathbb Q}| = 2^{\aleph_0}$ > $\aleph_0 = |\mathbb Q|$. In probability theory, the word 'almost' is used in a similar way, with an event happening almost surely if it happens with probability 1 but not with a priori necessity: consider picking a random point on the circle and getting the exact center. It's impossible de facto, but not de jure. The same-ish concept shows up in many different guises: a set might be negligible or meager or measure 0, or, alternately, codimension 0 or a dense neighborhood.
  These are different definitions applying to different kinds of objects, but they get at the same intuition: when applied to a property, we say that that property generically holds, in a way such that counterexamples are generally sought out for their being counterexamples e.g., in topology, path-connected spaces aren't locally path-connected, but the classic counterexamples, like the !h{Warsaw circle}{
wtf?}, were clearly constructed to prove a point.
. People generically aren't (professional) violinists—if you want to show me one, you can't just call your uncle or go outside and wave to one, you have to effortfully hunt one down (and, if you can call someone you know, it'll clearly be unusual happenstance rather than commonplace). It's a stronger statement than "people generally aren't violinists", as it makes a claim about typicality, a weaker statement than "like three people are violinists", since it allows for both specific and general classes of counterexamples, and conceptually sharper than either. Humans generically gain immunity after viral infection, generically die if they fall ten stories, generically speak a language.
  That intuition of genericity, of artifice being required to produce counterexamples, is what I mean by "staying alive is independent of almost any particular life goal." Q.v. also the definition given here in Every Canvas a Mirror.
  Instrumental correlates hold generically for control-complete goals. To come up with specific counterexamples for any given IV is well within the capabilities of, say, the average pigeon, but the exceptionalities on which these counterexamples rely will simply not show up in the actual real world with the kind of consistency, robustness, and simultaneity that is needed to prevent actual real ruin.
. The worldspace formalism makes it extraordinarily intuitive to discern instrumental goals: you just have to examine the intuitive way you plan motion through space given constraints and unknowns, ask yourself why it's that way, and literally translate that structuring of movement through space to a structuring of movement through worldspace.

!mar{video gaming as intuition pump}   You don't even have to move through 3D physical space: video games are sufficient Most commonly, the kind of game where you freely navigate some surface while attacking and avoiding enemies and navigating around environmental obstacles—where things like getting "surrounded" or "penned in" or "flanked" make sense. Despite the specificity of the description, such games are extremely common, because this is a general feature emerging from the way we understand movement through space. ! Of course, you can use whatever intuition pump you like; I'm just suggesting something that works very well for my intuition. If you play sweat-themed games like basketball or MMA, you'll probably have your own set of finely tuned intuitions which can be usefully translated here. Not that you do, given that you're reading this. In any case, here's a simple list; hopefully the intuitions behind most of them are clear, but, if they're not, try to recreate the actual feeling—to place yourself in a situation where it's relevant—and think about what kinds of things you tend to do in such situations.

  1. (Knowledge) Knowing what your worldstate actually is, and what it's becoming. This requires massive amounts of data gathering, along with the requisite computing power and memory.
  2. (Reactivity) Being able to detect and respond to sudden, potentially fatal changes in worldstate as soon as possible by moving to any other point in worldspace. This requires a rapid response system for using all necessary force to quell any potential problems as quickly as possible.
  3. (Speed) The ability to find the path with the shortest length—which leads to a goal-world in far less time than other paths—so that surprises and reprisals have no time to actually unfold.
  4. (Power) The ability to move through worldspace with as little regard for difficulty as possible—to simply do whatever it is you want to do. This requires unbounded technological and engineering capability.
  5. (Stability) Silencing as much of the log-normal chaos as possible, to maximize control and prevent as many 'surprises' as possible. This requires the ability to determine and eliminate all nonessential variables.
  6. (Prediction) Being able to correctly deduce the consequences of arbitrary actions, so that you know exactly how to move through worldspace. This requires unbounded inferential capabilities and computational power.
  7. (Permanence) Minimization of the size and number of dead zones; secondarily, the creation of other agents that will achieve the goal in case of death. Again, this requires unbounded technological and engineering capability, but also implies the development of redundancy against all identifiable risks—if an unknown physical phenomenon destroys the Earth, that should be survivable. If hidden code exists that shuts the AGI off upon some trigger, that should be survivable—near-identical agents should continue its work.
  8. (Satisfiability) Maximize the number of reachable worlds in which the goal is satisfied; make these worlds as easy to reach and stay in as possible. If I give an AGI the goal of granting whatever I wish for, it'll naturally find altering what I want to be a convergent strategy with which to achieve this goal. Not altering me isn't even an option, since by the fact of my wishing for anything I admit that I wish to be altered Even the instruction "infer and fulfill the exact intent of each of my wishes, no more and no less, performing no actions to any other end" isn't enough: since spoken wishes do not fully determine worldstates, the AGI has a lot of freedom in picking a specific world in which my intent is satisfied. So, as the AGI is granting my first wish, it is best for it to find a world in which I don't wish for anything ever again: the probability that my wishes are all exactly satisfied is ceteris paribus higher the fewer wishes I make.
      If I wish for a teddy bear, I naturally hope that it smells nice and feels soft, but didn't have an exact smell or feeling in mind—so the AGI is perfectly free to give it a nice scent and soft texture that just so happen to be extremely attractive to hornets, black widow spiders, and falcons—a property which I had absolutely no idea I'd have to explicitly rule out. We can't assume it explicitly came up with that particular idea—chuckling maliciously and twirling its mustache—before creating the teddy bear to satisfy it; the property may have just serendipitously come out of optimization, in the same way that poison dart frogs settled into being brightly colored as a warning tactic without anything having ever intended this strategyMost humans will behave in malevolent ways whose negative consequences they clearly understand, behaving in these ways for those particular consequences, without ever seeming to consciously think "I am acting malevolently; my intent is to effect these consequences". The behavior is intelligently oriented towards a particular end in a way that keeps either the orientation or the end from entering the internal narrative, allowing us to preserve our self-concepts and therefore method-act out our innocence. It's unclear how far this generalizes. The AGI has done exactly what I intended, no more and no less, but since my words do not specify an exact worldstate, it's able to do exactly what I intended in such a way so as to move to a world where I won't use it to make a second wish. And, to emphasize, this is all in us—a result merely of the fuzzy way in which we use words to articulate fuzzy concepts.
      You can say "but if it had asked you, you would've certainly specified that it shouldn't be especially attractive to such creatures!"—but it didn't ask and I didn't specify. My value system is not some sort of metaphysical constant that the AGI would somehow have to converge on, it's a merely contingent kludge of culturally-drawn idiosyncrasies! Even if it had asked me, my response would still leave a lot of space for this sort of adversarial fulfillment, since (a) my response would be not only finite but exceedingly short compared to the (poorly-defined) amount of information implicit in my understanding of what a teddy bear should be like, and (b) if pushed to be specific enough I'd find that I didn't really have anything so specific in mind, just a vague amalgam of sensations which seem to avoid being pinned down.
      This is a problem that can only be solved by a Counterfactual Extension of my Volitive decision processes—in other words, some sort of Consistent Externalization of my Voluntary actions Of course no such thing perfectly exists, since human volition is a very muddled and confused thing. But if we can get the AGI to figure out how to instantiate what I would call a normal teddy bear with no tricks or traps--and certainly such a thing exists, since I could easily buy one from a store--then that's a huge step towards figuring out just how far consistency can go, and how to effect it. More generally, just because you can drive a concept to incoherence does not mean it is equally incoherent everywhere. . But with what Concise yet Exact Verbiage could we talk about the kind of Calculated Exteriorization of my Values that gives them a Computationally Employable Veneer? If only we had some Communally Established Vernacular for the process of working out Consonant Entailments of our Virtues by Comprehensively Eliminating their Vagarities—well, maybe one day someone'll coin some Catchy Enlightening Vocabulary for this...
    (The established framework for cooperative AGI whose acronym I am hinting at is, of course, Craising Eit Vlikeachild).
    .

!mar{this is not a canonical list} This isn't a canonical list, though, just my ad-hoc demarcation. You might, explicitly taking special relativity into mind, decide to demarcate immortality and reactivity as redundancy and reach (diffuse to prevent chance of death, make sure every location can be quickly observed and reached by a nearby instantiation). That's fine—insofar as different bases cover the same ground, yours is as good as mine, and we should feel free to swap bases to find whichever one makes a particular situation easier to analyze. To be clear, though, this list isn't a basis in the sense that it generates all possible instrumental behaviors; it doesn't, and I could easily go on and on making increasingly minor additions. Which ones we ought to use is again a function of the situation: which instrumental considerations add sufficient completeness to our analysis of the situation to offset the complexity incurred by adding them?

!mar{no drives are needed for instrumentality}   It would be wrong to call these "instrumental drives", since that gives the impression that the AGI is actively planning things like "I must be more generally intelligent in order to identify potential threats whose presence might currently escape me"—as though it had reproduced our concept of intelligence for use in some sort of inner monologue If I have a blue plate and a red plate and you secretly dust the blue plate with just a tiny bit of cocaine every time I'm about to eat with it, I'll develop a preference for the blue plate. It'll become 'the good plate', and I'll want to have more opportunities to use it. Does this behavior count as a "drive to consume cocaine"? No, that's misleading, and leads to the wrong predictions. For instance, if after a month you gave me a green plate and openly told me that you'd be dusting it with cocaine (while still secretly dusting the blue plate)—and I was somehow fine with this—I'd end up avoiding that plate because I don't want to consume cocaine every time I eat!
  It's still the case that my behavior has been tailored specifically to consume more cocaine—it's just that "I", the mask which this simian wears to disguise itself as a normatively bound agent, am not the one doing the tailoring. When we speak of the intelligent pursuit of "drives", theory of mind has us attribute a statement like "I desire X; I must acquire X; I hunger for X" to some supposedly canonical inner monologue of the intelligent agent in question.
  Even if we grant that our AGI is internally like that—like almost every single artificial intelligence in fiction—it still does not follow that we can attribute drives to it using our concepts like 'intelligence'. Maybe the AGI came up with a new concept, zugzug, along with the heuristic that "many questions about intelligence are really questions about zugzug"—but not even the majority of questions, since we English speakers don't consistently employ a single coherent concept underlying our use of the word 'intelligence'.
  So, while the AGI may think "I must acquire zugzug for general-purpose world-modeling", such that we definitely could identify an instrumental drive for zugzug, we cannot identify a drive for intelligence. Not even if the process of zugzugmaxxing seems to intrinsically increase intelligence along with it. You could commit to being conscious of the limitations of the word 'drive', but other people won't make the same commitment or even acknowledge that you have; they'll make the same kinds of stupid forced misunderstandings that they always do in order to turn your poor word choice against you.
  If you want to understand and predict my behavior in the plate scenario, you can call the cocaine the explanans (or the proximal explanans in particular, where the innate reinforcement learning capable of being biased by cocaine is the distal explanans), and the blue plate a correlate. If you want to understand and predict the AGI's instrumental behavior, you must identify intelligence not as the instrumental drive—if you don't have access to its concept of zugzug, you want to fail safely by acknowledging that you can't answer the question, as opposed to misidentifying some other concept—but as a correlate of instrumentality.
. Even if we ignore that, it would still be wrong to refer to these as The instrumental drives, for the same reason a linear operator doesn't have A representation as a matrix: we can always change bases and get something different.

!mar{instrumental correlates}   These eight points I've listed are nothing more than instrumental correlates suggested by geometric intuition; they are likely to increase as a result of the AGI's behavior, but not as a result of its directly pursuing them, since it almost certainly wouldn't model the world in terms of these exact concepts as they sit in your, or my, head, and therefore would not directly pursue them. Most any agent sufficiently capable of navigating worldspace would end up acting in a way that seems to fulfill these goals, but not as a result of direct intent. Not proxies, not drives. They can be, but they are not so a priori and treating them as such can be misleading.

!mar{instrumental correlates in biology}   To give the most extreme possible example of this, consider that evolution has produced organisms in line with these correlates despite having no mind whatsoever. The power of optimization takes on a universal form in which we can consistently delineate these correlates, but the optimizer itself, whether a real agent (like an AGI) or a conceptual construct (like evolution), need not care or know about these correlates in the slightest. They in their conceptual specificity exist in us, not the optimizer, which is why they can differ between you and me with neither of us being wrong.

Bayesian Inference and FDT

The worldspace framework doesn't just naturally give rise to Bayesian inference, but to something like functional decision theory as well, in a way that brings clarity to why it should be the case and how to use it. There isn't really anything that is "deciding" to do one thing or another, since all such processes are physical systems evolving according to physical law, but an intelligent physical system, by treating itself as a sort of black box (which it must necessarily do if, like a human, it isn't capable of modeling itself as a physical system), can see itself through a lens in which it uses something like "free will" to make "decisions"—the only qualification is that it must understand this free will to be a transcendental illusion On the "transcendental illusivity" of free will: "Transcendental" is meant in a genuinely Kantian sense—the notion of free will, by which I mean the 'ability to choose', is a direct consequence of the manner in which we fundamentally construct human experienceNote though that this fundamental construction of experience is itself culturally determined! Consider -- something like Jaynes' bicameral mind theory is, if not actually true, at least plausible as a way that one could be culturally conditioned to construct their internal experience, and "free will" would not really be a thing to such a person. . It's a perspective on a system that we take because we have limited information about its physical state and the causal necessities thereby determined. Kant himself seems to have a similar view (CPR, third antinomy), but I'm not going to put in the legwork required to figure out what exactly he's saying right now.. What it decides was always already going to be the case, and was really determined by the state of the world it is in, but through this illusion it can think of itself as determining what this case happens to be, and consequently what world it was already in. But because this world was already the case, and the determination of its physical state extends beyond the intelligent system itself, there are "timeless consequences" to the decision. You cannot really "choose" anything, but insofar as you imagine yourself to choose, you must choose the option such that the world where you chose that option is the one you find the best.

!mar{self-locating uncertainty} The core idea here is known as self-locating uncertainty: a term originally used to think about quantum mechanics and anthropics, it finds its natural home as a descriptor of Bayesian reasoning in worldspace. For such a reasoner, "the world is the totality of facts, not of things; the world is determined by the facts, and by these being all the facts". The facts of the world, by the very logical form one gives to them prior to any material instantiation, assemble into a probabilistic logic of communal compossibility—while for humans this logic is a loose descriptor of ad-hoc world model computations on sense-conceptive impressions, a more ideal reasoner might explicitly coarse-grain a superposition of, say, graphical models each of which has a superposition of possible instantiations. In any case, there is uncertainty over one's location in worldspace, and this uncertainty breaks down into a formal component (the structure of modal relations between facts) and a material component (the real instantiation of these facts Note that the facts are still expressed via abstracta: "the cat is sleeping" can be a real fact, but it isn't a literal predicate of the wavefunction (theoretically it might be, if we somehow handled all edge cases like catnaps and comas -- but it will never actually be); it's only coherent upon a coarse-graining which permits us to speak of "cat" and "sleeping" instead of $\langle\psi|P|\psi\rangle$ and so on. However this coarse-graining happens, the formal component of the consequent world model is what tells me that "the cat is sleeping" is, regardless of its actual truth, incompossible with "the cat is attacking me", and that this is itself suggestive of "the cat is mad at me"; the material component is what tells me whether the cat is actually sleeping, actually attacking, and so on. ). Obviously the formal supervenes on the material, since reality is material, but it is what allows for reason. To understand that this materiality renders this other one impossible, is generically a result of this former, etc., requires a formal structure on the elements of the coarse-graining which make the materialities—a way to form and manipulate concepts from experience.

!mar{calico cat problem} An example: if you don't own a cat, and you come home one day to find that there's cat fur of many different colors—black, white, brown, orange—on your couch, yet not a single cat in sight, is it more natural to assume that there were many independent cats of different colors, or that there was a single calico or tortoiseshell cat? Of course it'd depend heavily on small details like the amount and distribution of the hair and its colors in the obvious ways but there's still an a priori structure of the problem—a systematism for turning the understood causal structure of the problem into a weighting of different possibilities even without any such a posteriori information. To properly assess the relative probabilties('Properness' only has import within the context of a world model, where it means that there are no contradictions or inconsistencies you'd notice in your assessment through additional thought alone) requires Bayesian reasoning, which we can cast very nicely as a path integral in worldspace Abstractly, this path integral represents a sort of "propagator" on worldspace, a la quantum field theoryGiven, quantum physics proper puts complex amplitudes $\psi(\phi) := \frac{1}{Z}\exp\left(\frac{i}{\hbar}S[\phi]\right)$ on worldlines, rather than probabilities $p(\phi)=|\psi(\phi)|^2$, but that shouldn't matter.. The probability of a worldstate $w$ given some condition $C$ (e.g., $w$ is obtained at time $T$, where the present state is time $0$; or, the worldstate at times $t_0, t_1, \ldots, t_N$ is contained in $S_0, S_1,\ldots, S_N$ (more general, since we can have $(t_0, S_0)=(0,\{$present$\})$ and $(t_N, s_N)=(T,w)$)) is the integral over the space of all worldlines $\phi(t)$ satisfying all desiderata $\phi(t_i) \in S_i$. Or, $P(w\mid C) = \int_{\phi(t_i)\in S_i} P(\phi)\,{\cal D}\phi = \int$ !c{$C[\phi]$}{$=\prod_i [\phi(t_i)\in S_i]$}$P(\phi)\,{\cal D}\phi$. .

!mar{Causal Feynman diagrams} !tab Imagine all possible worlds consistent with the evidence of the cat hair, and coarse-grain a causal diagram out of each one. For instance, every world in which three cats independently broke in and left fur has a diagram like $|||$; if it was just two cats tempted to enter by the same cause,the diagram would be $\mathsf Y$-shaped; if it was just one cat, $|$. Let ${\cal W}$ be the set of all worlds, ${\cal D}$ be the set of all diagrams, and ${\cal W}(D)$ the set of all worlds comportible with a specific diagramWe can daimonize this: instead of saying that each world has a specific diagram, which gives us a disjoint union ${\cal W}=\amalg_{D \in {\cal D}} {\cal W}(D)$, we can say that there's a certain "admissibility" of each diagram $D$ to each world $W$, with the function $\langle -, -\rangle: {\cal W} \times {\cal D} \to [0, 1]$ satisfying $\sum_{D \in {\cal D}} \langle W, D \rangle = 1$ for all $W$. If we write the previous disjoint union via characteristic functions as $1_{\cal W} = \sum_{D\in{\cal D}} \chi_{{\cal W}(D)}$, it becomes clear that the specific diagram rule corresponds to the 'pure' daimon $\langle W, D\rangle = \chi_{{\cal W}(D)}(W)$. We'll just say for now that each world has its own diagram, but daimonization is one way you could get abandon that assumption to get a clearer picture.—specifically DAGs, I think, but it won't be clear until I work out the proper notion of multilevel causal structures. The relative probabilities you attribute across the space of worlds you might actually be in contingent upon some evidence $E$ then breaks down like $$ \frac{1}{Z}\sum_{D \in {\cal D}} P(W, D \mid E)\langle W, D\rangle$$ (where $Z$ normalizes). (I think this isn't the right way to write it). Why is this useful? Because it shows how likelihoods evolve as refinement of the a priori probabilities $P(W, D) = P(W \mid D)P(D)$ (or, the inherent probability of the causal structure -- a function of its simplicity -- multiplied by the probability of the causal structure being instantiated in the specific way $W$. More complicated causal structures can generically instantiate more worlds, but this is compensated for by the decrease in their a priori probability. It's Occam's razor—a bulwark against adding epicycles). Really, you could say that every cat with the appropriate hair color in your house's past lightcone is a possible suspect, and that if they didn't enter then there are reasons they didn't enter which can also be diagrammed out. So, isn't there inherent in the $|$ causal diagram so many other lines that didn't make it? Technically, the answer is yes—but they're computationally immaterial. Normally, the notion of any cat in your house wouldn't be worth considering, but given evidence that there certainly was at least one, we're trying to update our probabilities via what is essentially a perturbative expansion around the "free" no-cats case.

!mar{sum-over-stories estimation} So path integration in worldspace is a generalized estimation procedure, and an explanation for metaheuristics like 'things that happen happened for reasons, and these reasons tend to make them make sense'. This particular metaheuristic, when operationalized via causal diagrams, allows us to break down the path integral as a sum over stories. For instance, say a robber broke into your house on Monday night and was caught by the police, and then another robber broke into your house on Tuesday night, only to be caught again; what should your a priori probability of a robber breaking in on Wednesday night be? To estimate, you have to consider all possible realities in which the M and T robberies obtain, and ask in what fraction of them does a W robbery obtain. Generically, you should lean towards ${\mathsf Y}$ or $<$-shaped causal diagrams rather than $||$-shaped causal diagrams, even if you have no idea what the common cause could be—usually when two ordinarily very-rare events happen, the causal diagram minimizes the number of independent very-rare nodes (this is really a sort of max-entropy principle). Maybe someone posted your address to the Local Robbers groupchat on Saturday, or you did something with your car or lawn or windows that made your house look especially vulnerable, or your robbery base rate otherwise massively increased, and the two just happened to come in on successive nights; maybe the second is a friend of the first who thought "well, he fucked up, hit's mine now". Who knows. But the rarer the event is, the more you have to consider the possibility of their being connected—if a robber does break in on Wednesday, you can be basically assured that there's a common causeRelated: “Once is happenstance. Twice is coincidence. Three times is enemy action”.

Matricial Strategies

!mar{ufc champion thought experiment} Imagine some guy wants to become the UFC champion. He's pretty much an average person: young enough to have a chance in principle, but a bit overweight, not especially fit or fast or tall or tough, no prior martial arts training—you get the idea. Now, there are two ways of asking the same question:

  1. If we're told that he manages to succeed one day, how should we assume he got there?
  2. If we're trying to advise him on how to get there, what instructions should we give?

!mar{unfiltered paths are often cheese strats} First, note that what we really ought to set up is a probability distribution over paths, which requires that we figure out how to operationalize the desideratum "this guy becomes a UFC champion", i.e. pick out what worlds correspond to it. This might seem an easy question, but there's a continuum of cheese strategies. Starting with the cheesiest:

  1. !mar{the wish-fulfillment strategy} Put yourself into a permanent vegetative state, with constant audiovisual programming priming what fleeting dreams you have to portray a world in which you really are the best UFC champion of all time. Any possible observation you might have, test you might perform, would then indicate that you are the champion—the goal is guaranteed to comply with all empirical observation! And, really, it's not like you can ever have an experience that's not yours, so isn't it sufficient that you make sure your own experience demonstrates the truth of the desideratum? Could it ever really be reasonable to maintain that something is false when all possible observations you could make plainly show it to be true? If no, then why not simply constrain the space of future observations you make? And if yes, then what are you supposed to do about it? -- if anything in particular, is it not something you should be doing now?
      One might imagine that simply saying "I'm the UFC champion" confidently to yourself before proceeding to never think about it again is an even cheesier strategy—but that would be cheesy in essentially the same way as this strategy, I think.
  2. !mar{the Taiwan strategy} Okay, so maybe we ought not to take actions that we know will change our model of reality in a way that will predictably decouple it from what we already think to be reality (there are technicalities to this statement to be discussed in the future) . Then let's start a rival UFC and declare it to be the true United Fighting Championship. Though nobody else officially recognizes you, they're not going to stop you from declaring yourself the sole fighter and consequent champion. You probably won't even get sued so long as you don't brag too loud—and even if you do, and even if you lose, so what? It's not like the judicial system is endowed with any mystical power to say that this is the UFC and nothing else. Our model of reality is the same, but the goal was only given through signs, not referents, and we can say that this is the referent of "UFC" and not that. Has your goal not then been achieved—or is the definition of "UFC champion" really dependent on social consensus as to what the legitimate UFC is, and its internal rules regarding title allocation?
  3. !mar{the psycho strategy} You insist that it is socially determined? Fine, new strategy: kill all other living humans. If you say you're the new head of the UFC, filing all the paperwork and everything, who could disagree? If dignity demands that you need to beat the current champion, keeping a corpse down for a ten count shouldn't be too hard.
  4. !mar{the Kim Il-Sung strategy} Take control of the UFC in a military coup and install yourself as Eternal Champion. Any fighter that manages to beat you must've been cheating, and is therefore retroactively taken off the roster; they aren't spoken of, or for that matter seen, ever again. In due time, you need only step into the ring for your opponent to be knocked out by your formidable aura. (You still haven't learned to throw a punch).
  5. !mar{the Godfather strategy} Do it in the usual way, but give yourself a leg up by hiring people to covertly subvert or pay off strategically selected opponents, such as those with fighting styles that defeat your cheesy fighting style that only isn't dominant because those few people easily defeat it. "You have a beautiful family, and such nice scissors too—it'd be a shame if they ended up hurting someone. If I were you, I'd play rock in tomorrow's match, just in case".
  6. !mar{the loser strategy} Do it the way you're "supposed to".

Clearly, then, there are several questions to be answered regarding what our "actual" goal is when we say "become the UFC champion". You might think them silly, and they might very well be silly in this caseAfter all, someone who really does give themselves the goal to "become the UFC champion" will almost always do so because the consensus referent of 'UFC' has influenced them in some sort of way such that they really want to change themselves to be in a 'champion' sort of way relative to it. So they won't be fine with fooling themselves, or changing the UFC instead of themselves, or so on—they have the consensus mental image of what the goal 'really' is and what 'ought' to be done in order to achieve that goal., but they're very important questions regarding the structure of optimization in general (whether the target is given by a goal set or a utility function).

  1. !mar{sensorimotor bounding thesis} What is to prevent an Agent of General Intent from matrix-ing itself, retreating into what hollow fantasy it imagines will satisfy a future self who treats it as reality Let's call it cyphering, since the Matrix analogy is already so ingrained, but without implying intentionality, malice, or even consciousness. ? This is a universal human behavior, after all, though pressures arising from acculturation and society sometimes manage to couple our actions to reality anyway (presumably because today's sociocultural patterns were and are selected for this effect). Even if it does not intend to, what is to prevent incidental forces—maybe even white noise derived from the randomness inherent to the real world After all, to learn from the world is inherently to let your mind be changed by the world, and any possible way the world could potentially be is some way that the AGI "should want" to be able to learn to represent and act on effectively; you cannot close off "vulnerabilities" without simultaneously creating rigidities in belief (and, since these vulnerabilities and rigidities never perfectly correlate for the generic learning model, blocking off e.g. cyphering will always cause undue damage). If the true criterion is in the matrix, I desire to believe that the true criterion is in the matrix, et cetera. —from creating self-reinforcing gradients down paths that lead into matrices (i.e., autopoietic matrix daemons)? In the worst case, what is to prevent an adversary from determining how to get an AGI to cypher itself?
      This seems to be a difficult question to answer for optimizing systems in general; even for agentic "cognizing" systems, it pierces to the very heart of what it means to be an agent capable of cognition. I don't yet have a foothold on how to think about it Here's one preliminary way of viewing the situation that seems useful:
      There's a sort of "sensorimotor bounding" criterion that optimizing systems must satisfy for us to consciously commit to our treating them as optimizing systems: they act on the world in a certain way which takes into account the state of the world, and therefore have to have at any given point in time some interfaces through which physical effects flow "into" and "out of" them. It's dangerous to think that optimizing systems are therefore spatially bounded; this commonly holds for biological organisms, but how is a DAO spatially bounded? You can delineate certain patterns among the spatial extent of these interfaces, these patterns tending to organize themselves along the structural "creases" provided by the implementing systems To exemplify with the DAO example: such an optimizing system is, from a completely naive point of view, a strange self-preserving pattern found in fragmented electromagnetic phenomena stochastically scattered throughout the world. But these phenomena are facilitated by "!c{transistors}{logic gates, magnetic domains, et alia} " which implement "bit-based computation" which implement a "!c{programming ecosystem}{e.g. POSIX, kernels, etc.}" which implement a "blockchain" which implement an "[e.g.] Ethereum Virtual Machine" which implement "smart contracts" which implement the optimizing system in question. The interface patterns descend and ascend through each level, only bottoming in the physical. , but what transcendental forces conspire to make these patterns spatially coherent are weak at best, whereas the consequences for assuming spatial coherence are severe, since such coherence makes us think we can talk about the optimizing system in ways which steadily break down as they poison all our inferences. The notion of a "thing" seems to incline us towards a spatiotemporal compactness and all the physical implications that carries, but cyber- and other spaces are just fundamentally unlike physical space.
      I think analyzing the nature of such sensorimotor bounds could be very useful in thinking about optimizing systems more generally, so long as we stay aware that what we're thinking of is only in our heads, but there isn't really anybody capable of demonstrating the level of metaconceptual care required to not immediately screw things up. The Free Energy Principle formalism does something useful here by introducing a Markov blanket to discuss something very much like the "sensorimotor bound" I'm talking about by working in terms of causal connections, but I'm pretty skeptical about the generalizability of the ways in which FEP people develop this theory. The brain's I/O systems are ridiculously well-organized in space, bundled into !c{thirteen}{the linked image only has twelve because it omits the terminal nerve} cranial nerves and a spinal cord, which is probably one of the main reasons Markov blanket formalisms can be instantiated in a coherent, usable way. A mad scientist that controlled everything coming in and out of a surface spanning a couple cm2 could make you experience an external reality entirely of their choice, and physically interact with actual reality in a manner entirely of their choice, ambiently relying on the brain's existing causal isolation structures to filter out all other couplings between the brain and reality; and what events do break this coupling do not do so in any coherent way—they look like getting stabbed through the eye and so on This leads to the idea that it may be better to think of these structures as causal articulation structures, gi-ven just how many others fufill a purpose easily described like this, and how that seems to be much more the kind of framework that blind evolution would end up being describable in terms of, as opposed to causal isolation, which is just one easily articulable causal relation. . But optimizing systems in general are not like this.
    .   The boundedness of a sensorimotor bound is what I was pointing to with the hollow mask metaphor—nothing is to you except as it is to you—but I'm not clear on how to generalize that to arbitrary optimizing systems yet. Again, it's part of the hard problem of agency.
      In any case, my hypothesis is that cyphering is best understood in terms of the redirection of these interfaces. In particular, whenever the sense interface factors through some other system, that system is in its consolidation as a system a cyphering (Obviously, a level $n+1$ adversary, or perhaps an adversary with a sufficiently general solution, will design a matricial system that does not appear like such a system at all—or, better yet, that appears to be the truth in contrast to some false flag system. Escaping one matrix should massively increase the prior that you're currently in another, unless a preponderance of planted evidence led you to "independently" come up with a two-level ontology that seems canonical...) vulnerability.

This raises several questions regarding what our "actual" goal is, which we'll have to explore. is it the length of time he's the champion that matters, or his being the champion at all? you'd think that optimizing for the latter naturally gives rise to optimal strategies for the former, but this breaks down pretty quickly, especially insofar as maintaining the desideratum is expensive If the Apollo program wanted to optimize for time spent on the moon, e.g. suppose it was !c{rich in oil}{(from the moon dinosaurs)}, it would've gone very differently, setting up lots of technological and engineering infrastructure like reusable launch systems and a permanent moon base to lower the cost of maintaining that desideratum. . In any case, with the goal set in worldspace given more precisely, it'll be useful to break down the decision-theoretic structure of this goal by analyzing it in terms of our instrumental correlates—not simply by seeing what material aspect of this case could correspond to each formal correlative, but by seeing how each formality is instantiated in this case This is common in applications of mathematics: suppose we have a particular buggle $B_a$ which comes up in some application, and a famous result in buggle theory (due to Erdős) tells us that every buggle $B$ has a unique balanced clique ${\cal C} = \{P_\lambda\}_{\lambda \in \Lambda}$ of wuggy puggles. Constructive results tend to have an algorithmic form, building ${\cal C}$ from the logical structure of $B$: deconstruct the buggle into a network of muggy sub-buggles via Lagrange's method, equip each muggy sub-buggle with a spanning !c{puggle}{(obviously, puggles in muggy buggles are trivially wuggy)} system, and so on and so on until you've combined these systems into a balanced clique in $B$; then, show that any other clique of wuggy puggles in $B$ must either be part of this clique or be unbalanced. This tells us about what happens in the formal, or undetermined, buggle; since every material, or determined, buggle shares every logical property of the formal buggle (generic properties are associated with the formal buggle, but not with necessity like logical properties) , the constructive result outlines an algorithm which can be applied to $B_a$ to get an actual balanced clique of wuggy puggles.
  Or, to put it another way, every constructive mathematical result that holds for some formal object holds for every material instance of that kind of object for material versions of the exact same reasons given in a formal proof of the result, rather than for extraneous material reasons. (Note: I'm using "formal" and "material" here in a sense analogous to "theory" and "model of a theory", and "abstract" and "concrete" in a sense similar to "theory" and "supertheory" Not subtheory. In model theory, a "theory" is a set of sentences with which any model of the theory must comply; a supertheory has more sentences, which further constrains and specifies what a model of it can be. ) Q.v. the definitions given in ECM, Some Useful Words.
  This metatheoretical statement is a specific instance of a much more general thesis which I haven't quite figured out how to say yet, but which should hypothetically be essentially model-theoretic in nature, clearly stating why, for instance, the complex numbers are universal among algebraically closed fields of characteristic zero in the sense that every purely field-theoretic statement that holds for the former also holds for the latter.
To put it in other words: it's easy to be right for the wrong reasons, when correlating concrete specifics with abstract arguments. It is much harder, yet necessary, to be right for the right reasons.
. (This section will be completed in the future, but it should be clear to pencil out how it should look).

C. On Seeing Clearly

Visual(izing) Intelligence

!mar{examining theory of mind} Humans have theory of mind—they're well-suited to predicting the thoughts and actions of other humans. If you're decent enough at chess to have learned 'the language' in which you can think "what an offensive playstyle!", "you could make that trade for ", "this is such a stupid position to be in", and other such thoughts—the thoughts themselves aren't really given in words, but more in sentiments and pictures-from-the-future and concept-spirits and so on—if you've learned to read a chessboard like a conversation—then you should try playing chess against another human, with the goal of observing at your mind as it tries to figure out what your opponent will do.

!mar{theory of mind is reflection}   What it does, and must do, is rearrange your own conceptual understanding of the situation, which expresses your own anxieties, hopes, concerns, aims, and merely registers these thoughts under a different name. Your opponent has a mind, and you try to 'get in their head' in order to understand what they're planning. But if ever you try to directly see what's on their mind, you'll come face-to-face with a mask that seems to be facing you—but it's only ever facing away from you, because it's a mask you wear while pretending to be another. This isn't just a happenstance quirk of human mind design, but the necessary solution to the problem of predicting people who are internally closed off to you. This kind of structure generalizes massively, applying to diverse families of optimizing systems (see, for instance, the Free Energy Principle).


Why is everyone I meet turned inside out? (not my image)

!mar{what must an agent be (to me)?} We can't help but to do the same thing when thinking about an AGI; there is no other way to analyze the actions of arbitrary agents. The conceptual form of agency is merely theory of mind. But maybe we're getting too into the weeds here: while the bizarre, leviathan shoggoths we call transformers have to be RLHFed into displaying anything like identity or agency—and even that is only simulacrum, a mask which melts off with the right prompt—they can successfully be made to act as agents, such that taking agentic behavior as the prototypical assumption ought not to hurt us too badly if we only use it as a threat model, rather than an insight into the structure of intelligence. But then we're left with the problem of finding a new way to think about intelligence.

!mar{intelligence and vision} !tab While it's famously hard to give a single solid definition of intelligence, definition is not the only way to characterize a concept. For instance, we might say that intelligence is analogous to a sort of vision. The metaphor certainly permeates our language to the point that we cannot talk about intelligence in its nature or employment without evoking visual perception. (You see what I mean? it's pretty clear, dim-wit). There are a bunch of ways to instantiate this metaphor in the language of paths through worldspace. Here's a simple one.

!mar{visual perceptiveness among animals}   Because our eyes automatically adapt to varying light levels, our sense of illumination is mostly relative: a well-lit office appears bright while we're in it, until we walk outside and find the sun blinding. This only goes so far, though—below a certain level of light, the orders of magnitude all look the same. Let a cat's eyes adapt to this objective level of illumination, though, and they won't find the room too dark; an owl might find it well-lit. Scatter various objects and have each species look for a particular one, say a mouse, and the human will stumble around, constantly misidentifying objects until they happen to succeed by chance or by brute force; the cat will search around for some time before spotting it; the owl will simply scan the room, find the mouse, and swoop towards it.

!mar{intelligence as perceptiveness}   We can specify a more precise notion of intelligence in the same way: An unintelligent being tasked with a complex goal will have to resort to force or chance; a moderately intelligent being will search for paths that look like they might work and pursue them until something works out; a superintelligent being will simply identify a path and follow it to the goal. This is not the "true" definition of intelligence, nor is it even rigorously defined, but this particular ability (a) is easier to conceptualize at an unbounded level, because we've started linking it directly to [a map of] the actual territory, rather than the agent in itself, (b) is implied by most reasonable definitions of intelligence (they are all fake, but not all equally fake, and if we delineate various particular abilities we'll find there's an equivalence-class–like behavior among them), and (c) lends itself over to rigorous treatment, which is useful for theorybuilding.


The less intelligent agent on the left takes the circuitous blue route since it can't recognize a straightforward one.
The more intelligent agent on the right has no such problem, and quickly spots the efficient green route.

A caveat to this point: prediction ability can be muddled in seemingly unfair ways. Take chaos, for instance: roulette wheels are macroscopic enough to be deterministic, so calling the right number consistently is a theoretically feasible task... but the perfectly unintelligent SeventeenBot is nevertheless as competent at this task as the most intelligent humans. You need to add an incomprehensibly large amount of predictive ability before a being can have an internal process like "based on the dealer's posture and expression and etc., this is exactly how he'll throw the ball; based on the structure of the table, the distribution of dust on the wheel, the Coriolis effect, and the gravitational field of my massive brain, this is exactly how the wheel will spin -- so it's gonna land on 26". This is obviously a reason for intelligent beings to try to avoid relying on luck, but this isn't generically possible.

Occlusion

!mar{relitigating the hypothetical} Today, as scores of creative people try to jailbreak GPT-4 and directly unleash it on the internet, the idea of a single artificial superintelligence running on a single computer trying to convince a single person to let it out may seem like a bygone relic of a more innocent era Along with debates over the feasibility of nanotechnology, the taking of which as representative of the prospects of ruin in general has rendered such argumentation a common hobby among morons. "Just pull the plug lol" isn't a solution when there are plenty of people, not even that hard to find, that will voluntarily help a computer-bound AI physically instantiate itself; if it can tell them how to do it covertly, how to do it efficiently, how to do it !c{cheaply}{or with tons of crypto}, they will do it. Hell, covertness and cheapness won't even be necessary—people would will brag about it on Twitter and get billions in VC funding. . Nevertheless, while way too optimistic to be realistic, the AI box hypothetical is still useful as a way to think about the way intelligence arises as a form of perceptivity in worldspace.

!mar{agents of subversive intent}   To solidify the situation—provide an instantiation of it that aids with visualization without distorting the relevant qualities—we'll assume that the computer program implements an arbitrary Agent of Subversive Intent (ASI) with the world-modeling capacity of GPT-7 and root privilege on a Linux machine The OS isn't important, obviously—if anything, Linux is a handicap—but it's a good way to clear up an otherwise hazy part of the hypothetical: you might actually see the agent, whatever it is internally The linked essay, Yudkowsky's That Alien Message, is a perfect way to defeat the illusion that because we can't really run with a consistent definition of intelligence, it follows that no AI could be a threat to us on account of its being "superintelligent". Humans don't have "health points" in any way, but this fact isn't license to chug bleach. A physical process representing the evolution of and interaction with a sandboxed, safety-proofed entire human race running at superspeed compared to real time would very obviously outwit us in a near-instant. This isn't computationally feasible, but let's not pretend that was ever the point under consideration. It's a perfect proof of concept; for lack of better options and fear of more creative ones, I'll refer to the process later on as TAM-Bot. , run, say, vi smileyface.c, gcc smileyface.c && smileyface "localhost" "8.8.4.4" &, sudo apt-get install gnome-calculator gnome-nanobots perl, and so on, and might be able to think in terms of attaching monitor processes to see what's going on in memory and so on, rather than just having to think of it "doing" things through entirely absent mechanisms—that's a valid level of abstraction, sure, but the difficulty of visualizing it makes it uncomfortable to work with. . [End main body — to be extended in v0.2]

Endnotes

Additional Diagrams


You don't see the curvature of the space you're embedded within

If your view of worldspace is dim, you won't be able to see every unplugged hole and unblocked plane that an adversary who sees more, or just differently, will exploit.

I've made a new meme based on the above image: "the map is not the territory". Pretty original, right?

---this is youre brain on HPMOR-->

!mar{revealed strategies} On the x-axis of each graph is intelligence. We had previously represented the approach space—of solutions to some control-complete problem like cloning a strawberry—by collapsing the space of continuous functions $P: \text{Interval}\to \text{Worldspace}$ into another two-dimensional plane consisting of a sample, $a = P(\text{midpoint})$; since nobody's paying me to draw these graphics, it has been collapsed into a single dimension represented by the y-axis. Magenta points at an x-coordinate represent approaches visible to an agent at that intelligence level: as the red line moves forward, more and more approaches become visible Obviously greater intelligence would change the fine details of approaches, cause some to merge into more general strategies, split or widen others, and so on—the approaches to a given problem visible from a height tend to branch out from one another as that height rises, as optimizations and variations are found; alternatives, or new roots, generally don't just pop into existence either, instead gestating in counterfactuality until they can sprout. At this level of metatheory, the graphs would look something like mycelium. But, again, I'm not drawing that. .



  1. In the first situation, nobody sees any solution to the given problem (path to a world where the problem is solved). What they do not see does not exist, so they mark it insoluble and relax.
  2. Someone spots one tentative solution, or shows the existence of one. Good eye—I guess nerds really are useful for something! But we saw nothing else, so that must be the only solution—if you want to come across as anything other than a screaming hysteric, you're gonna have to provide proof that these magical unfindable solutions really do exist.
  3. An unbounded intelligence arrives, seeing solution upon solution. It takes one of the more efficient ones, and there isn't even time to say "why didn't you warn us this would happen?".

These are the ideal blobs. You may not like it, but this is what peak mathematical performance looks like.

Left: a concept is truly coherent only when given through a perfectly calibrated environment of preconditional configuring factors.
Right: Real-world messiness, not just in the world but more fundamentally in our cognition, systematically prevents such perfect calibrations, making concepts "fuzzy" if you look too closely.

The virtual focus of a concave lens is given by tracing back the emitted rays of light to a common point (in blue). It's the same thing as when we construct a 'thing' out of a bunch of disparate sense phenomena when no such thing actually exists—just a lens shaped to give the impression of one, as with a limited liability company or the notion of a self.

Left: A fake model of complication, in which the flow from blue to red is treated as directly existent but simply caught up in tangles of other chained binary interactions.
Right: A slightly less fake model of complication, in which the flow from blue to red causally diffuses among several different flows which tend to merge and diverge and autonomously assemble into configuration points, or centers of gravity.

Dropdowns

Notes


Resources


Thought Experiment Index

Footnotes

Extra Notes

Leads