Abstraction's End
home page
original site
twitter
thoughts
about
Linked TOC

(Published June 9, 2023 by Mariven)

Thesis

The idea of probability is one we commonly rely on in order to talk about our world models, especially in communities where their interpretation as credences, or subjective degrees of belief, is widespread. More relevant to artificial intelligence is the manner in which we implicitly bake probability in to our theories of decision and utility, justifying it post-hoc with e.g. the VNM theorem or complete class theorem. But the axioms consistently fail to apply; no matter how hard we grasp, reality always seems to slip away from our models of it. Here, I compile a lot of resources that demonstrate why probability is difficult. Some of the questions I want to point to, and list resources for figuring out answers to, include:

Context

A year and a half ago, I wrote Probability is Difficult, a review of the foundations of probability theory: the interpretations of various intuitive notions of probability, the axiomatic systems through which we use them in mathematical applications, and the various pain points and paradoxes that we have to watch out for if we want to be not just consistent but correct in our use of probabilistic reasoning. The core notion I wanted to impart is in the title: probability is difficult. It's not so simple as setting up numbers and playing with them—you have to couple those numbers consistently to reality, and this is so incredibly hard to do correctly. While little more than a thorough compilation of existing work, writing it was very educational, and I figured it would be worthwhile to compose a LessWrong version tailored to this site's idioms and intents.

In the course of cleaning it up, I found that I could do something much better: make it a guide on the proper and improper use of probability in general. Not just from inside the mathematical perspective—what kinds of mathematics don't break down due to their own logic—but from outside it, the place where you're deciding what mathematics to use and how to use it. Not just how to interpret the use of probability, but how to employ it: our theories of decision, utility, and learning fundamentally depend on probabilistic reasoning, both at the object-level (as when we speak of maximizing expected utility) and at the meta-level (as when we argue about the Solomonoff prior), and this has motivations and consequences; we ought to be aware of when and why it breaks down, which it does surprisingly often.

Unfortunately, I realized after starting to work on this updated version that to sketch this full picture to any adequate level of detail would take perhaps two months of work, which is more time than I can afford to spend. So, all I can do for now is try to outline how someone who wanted to understand exactly why, how, and whereprobability is difficult would go about doing so, by listing various resources, key terms and areas of study, and lines of thought being pursued. I am trying to list resources that, whether by saying worthwhile things or pointing to other worthwhile places, are useful for structuring one's understanding of a given topic.

The main resources I've compiled are given as fifty-something bolded links weaved throughout exposition. (Were it just a plain list, people would go "wow! cool!", bookmark it, and never read it again; I want to give you some idea of the underlying narrative which makes them important to understanding why probability is difficult, so as to actually motivate them).

I'll assume that you already have some knowledge of probability theory, and know the basics about frequentism, the Kolmogorov axioms, Dutch books and Bayesian inference, and so on. If you don't, then, again, the original version of Probability is Difficult is a great place to start.

Interpreting Probability

What Should "Probability" Mean?

Interpretations of Probability

There are many ways to interpret the idea of 'a probability', the most common of which are:

  1. Probabilities are proportions of the space of possibilities which produce a given measured outcome (e.g., two rolled dice summing to five has probability $\frac{|\{(4,1), (3,2), (2,3), (1,4)\}|}{|\{(1,1), ..., (6,6)\}|} = \frac19$).
  2. Probabilities are objective properties of reality describing the propensities of possible measured outcomes;
  3. Probabilities are objective properties of reality describing the (actual vs limiting) frequencies at which a certain (real vs ideal) experimental apparatus produces a certain outcome;
  4. Probabilities are cognitive constructs (subjectively vs objectively) constructed to convey an agent's subjective credence, or degree of belief, in a certain outcome.

These are respectively known as the classical, propensity, (finite vs hypothetical) frequency, and (subjective vs objective) Bayesian interpretations. Note that the terms 'subjective' and 'objective' are used in two different ways here: the frequentist and propensity interpretations are objective where the Bayesian is subjective in that they attribute probabilities as being internal to the object, i.e. the world, rather than the subject, the one reasoning about the world. Bayesianism's subjective-vs-objective split is about the extent to which the reasoner is forced to construct their internal probabilities in a single correct way based on their prior knowledge. (The update from prior to posterior given some data is always the same, though).

A very quick and marginally illustrated overview of the main interpretations of probability is given by Nate Soares' Correspondence visualizations for different interpretations of "probability", which fits in a trilogy of tinyposts about probability interpretation.

A much larger, more thorough discussion is given by the Stanford Encyclopedia of Probability's page. As expected, though, it's largely embedded in the explicitly philosophical literature, filled with citations so as to mention every philosopher's opinion no matter how stupid or irrelevant. Still, though, it's a pretty good place to get your bearings. An even better discussion is given in the excellent review Probability is Difficult. (The author remains pseudonymous, but the stunning combination of philosophical deftness and technical expertise speaks volumes as to the author's erudition, lucidity, and, above all, humility).

If one wants a textbook-length explanation from a philosophical standpoint, there's Donald Gillies' Philosophical Theories of Probability, which also attempts to answer the question of when and where we should use this or that interpretation—see e.g. Ch. 9's "General arguments for interpreting probabilities in economics as epistemological rather than objective".

There are other ways to slice up the subject beyond these four interpretations: see e.g. the Wikipedia page on aleatoric and epistemic uncertainty; these correspond roughly to frequentist and Bayesian notions, but take a slightly different point of view that keeps the correspondence from being a clean one; just the same, there are ways to slice up the subject that fall neatly within these interpretations and should not be confused with 'probability'. Likelihood is one such notion, its differentiation vs probability being explained by Abram Demski's Probability vs Likelihood, as well as the Wikipedia page on likelihood functions.

Subjective Probability

If you want to know how to think like a Bayesian, look no further than the bibliography of E. T. Jaynes. The father of both modern Bayesianism and the acerbic overconfidence of its proponents, his textbook Probability Theory: The Logic of Science builds the entire subject of probability theory as a connected edifice of techniques, heuristics, and formulas for accurately modifying and applying your beliefs about the real world (link is to a pdf; see this LW post of the same name for a much shorter review/walkthrough).

Two more conceptually oriented texts on subjective reasoning by notorious Bayesians are Bruno de Finetti's Theory of Probability and I. J. Good's The Foundation of Probability and its Applications; these look more carefully at where probability and its logic fundamentally come from, and the latter in particular takes care to demarcate questions of probability from those of statistics, utility, and decision. Chapter 3 of the latter, in particular, is a (two-page) article entitled 46,656 Varieties of Bayesians, which—as you might have instinctively guessed from the number—is a combinatorial division of kinds of Bayesianism based on their answers to several different questions.

Jaynes spent most of his career dunking on frequentists, and occasionally this produced useful observations on the Bayesian philosophy and approach Jaynes is also the one first responsible for giving Bayesianism the name problem it currently has: in its smallest form, it just speaks about how to update your credences given priors and data; in its largest form, it's a life philosophy which ought to constrain cognition, or govern reality, in full generality. This causes an undue amount of confusion, but see e.g. Kaj Sotala's What is Bayesianism? for some of the tenets of the LW-style mode of reasoning, or nostalgebraist's what is bayesianism? we (i) just don't know for a useful distinction between "synchronic" Bayesianism (the axiomatically true application of Bayes' law to probability estimates) and "diachronic" Bayesianism (a statement that beliefs ought to be represented as probabilities that update via the conditionalization rule), and for a discussion of where diachronic Bayesianism misleads or outright fails people. . His What's Wrong With Bayesian Methods? outlines a key conceptual distinction: when Bayesians speak of the "distribution" of a parameter to be estimated, it is not to imply that the parameter itself is being treated as nondeterministic, or that we're speaking of its relative likelihoods of taking on certain values across a wide range of situations, but simply that we're trying to estimate the value that the parameter actually has in any particular case. In other words, Bayesian reasoning is generally conducted in an "estimation" scenario, not a "deduction" scenario.

The Origin of Priors

Bayesianism's biggest, most notable problem—whence priors?—is also the source of its largest split, between objective Bayesians and subjective Bayesians. Again, note the distinction between objective Bayesianism and objective probability interpretations such as frequentism. Jaynes, an objective Bayesian, describes what the term means in his paper Probability in Quantum Theory:

Our probabilities and the entropies based on them are indeed "subjective" in the sense that they represent human information; if they did not, they could not serve their purpose. But they are completely "objective" in the sense that they are determined by the information specifed, independently of anybody's personality, opinions, or hopes. It is "objectivity" in this sense that we need if information is ever to be a sound basis for new theoretical developments in science.

And again in his Prior Probabilities and Transformation Groups:

A prior probability assignment not based on frequencies is necessarily "subjective" in the sense that it describes a state of knowledge, rather than anything which could be measured directly in an experiment. But if our methods are to have any relevance to science, the prior distribution must be completely "objective" in the sense that it is independent of the personality of the user; i.e., it should describe the prior information, and not anybody's personal feelings.

The original objective Bayesians applied the principle of indifference to get priors: when you want to estimate some parameter in some range, start with a prior that is constant over the parameter space, giving equal weight to all options. This was famously called into question by Bertrand's paradox, which showed that slightly different ways of constructing the exact same parameter space can give different uniform priors, as well as by the similar wine/water paradox, which makes the same point in an even more insoluble manner. Jaynes's paper The Well-Posed Problem attempts to save the principle of indifference by showing that there is in fact a single unique way to be indifferent to the parameter in Bertrand's paradox; he nevertheless admits that his transformation group approach cannot solve the wine/water paradox. (And his solution for the former is problematic as well: as explained by Alon Drory's Failure and Uses of Jaynes' Principle of Transformation Groups, the method by which Jaynes supposedly found a single canonical solution can be adjusted to find each of the other two solutions as well).

The modern, high-tech version of the principle of indifference is the Jeffreys prior, which is proportional to the square root of the determinant of the Fisher Information matrix; it is invariant under reparametrization, and thereby manages to avoid bias in the wine/water paradox. In practice, though, Jeffreys priors usually tend to be non-normalizable, as in most of the examples on the Wikipedia page. (This doesn't dissuade everyone, though: see Andy Jones's Improper Priors for a tutorial on how they can be used to derive proper posteriors).

Another attempt by Jaynes to construct objective priors for Bayesian analysis is the principle of maximum entropy, or MaxEnt. His paper Notes on Present Status and Future Prospects talks about the role played by the Maximum Entropy principle in objective Bayesianism, and, more generally, discusses why it's important to reason about states of knowledge rather than the world directly. Most urgently, see his Where do we go from here? for an account of a rap battle concerning the objectivity of MaxEnt.

For a much larger exposition on the problems with objective Bayesianism, see Elliott Sober's Bayesianism: Its Scope and Limits.

Objective Probability

Frequentism

If you want to see how frequentists think, most standard statistics textbooks should do. Owing to the dominant role played by frequentists in the foundations of statistical inference, and the fact that the foundations of probability are usually just taught as a sidenote in most statistics classes, people hardly learn to work with probability except in the frequentist context where you have some experiment, some outcome, and you want to pick some test statistic (function that computes the 'extremeness' of some aspect of the data) in order to establish some sort of confidence interval or reject some null hypothesis with some low p-value (probability that the null hypothesis would give data w/ a test statistic at least as extreme as the given one This is the correct definition, where probability is interpreted as limiting frequency. The fact that scientists are routinely ignorant of it makes for a criticism not of frequentism but at most of contemporary frequentist pedagogy.).

Foundational issues aren't really discussed all that much by frequentists; the biggest split among them is not between finite and hypothetical frequentists, but between Fisherian significance testing and Neyman-Pearsonian hypothesis testing (for which see my sequel Statistics is Difficult). In fact, Alan Hajek, in his Fifteen Arguments Against Finite Frequentism calls it "about as close to being refuted as a serious philosophical position ever gets" You would wonder, then, why he'd bother arguing against it—his Fifteen Arguments Against Hypothetical Frequentism informs us that he just wrote thirty arguments against frequentism in general, only bisecting this list in order to meet a journal's length requirements..

Propensity

It seems to me that the propensity interpretation is primarily discussed by philosophers, so there can't be much of use there. Again, my original article goes over what little there is to discuss. However, I'll take this opportunity to discuss an objective notion of probability that behaves more or less like propensity: quantum-mechanical probability.

I haven't actually seen much discussion on the role that quantum mechanics should play in our understanding of probability—most discussion goes the other way around. When we measure the spin of an electron in a Stern-Gerlach experiment, the outcome seems to be random, with each spin (up or down) being equally likely for each electron. The canonical quantum formalism tells us that this does not come from a real-valued probability distribution over the space of possibilities, but from a complex wavefunction over this space More generally, we need a density operator rather than a wavefunction. There are actually ways to replace this operator with a real-valued function that captures the same information, and this function acts like a probability distribution, but it generally takes on negative values. Possibilities need to be able to interfere, whether by having different complex phases or different signs (c.f. the Elitzur-Vaidman bomb tester). , with interference of complex phases forming much of quantum's "weirdness".

This wavefunction is, in a sense, the "square root" of a probability distribution, in a sense made precise by the Born rule Measuring an observable $A$ with spectrum $\{A|x_\lambda\rangle = \lambda|x_\lambda\rangle\}$ in a system with wavefunction $|\psi\rangle$ will yield a value $\lambda$ with probability $|\langle x_\lambda\mid\psi\rangle|^2$ . This rule is a fundamental postulate of quantum mechanics, and there's no broad consensus on why it should be the case; we know that it can be derived from other assumptions, but that only defers the question. (There are also many attempts to derive it directly from the assumptions of each interpretation, see e.g. notable physicist-turned-controversieur Robin Hanson's " mangled worlds" argument that Born rule-type probabilities show up naturally in MWI).

To understand this question—what notion of probability obtains in actual reality?—is obviously an advantage: while such real probabilities operate independent of epistemic probabilities My epistemic probability distribution over the value of the $10^{10^{100}}$th digit of $\pi$ is uniform, and justifiably so—it has never been computed, and $\pi$ has no known biases among its digits (at least, that I know of), so there's no reason for me to think it might be more likely to be any one digit than another. But there is a single correct value. , they offer the objective standard for logical coherence, giving us what mathematical and conceptual tools are necessary to build a (literally) realistic model of probability. Beyond this, the nature of quantum nondeterminism has significant implications for anthropic and ethical reasoning.

To be honest, though, I do not find these arguments for a quantum understanding of probability strong enough to decide that working on this is worth the time it would take. If there are good reasons, I'd like to see them—or, if anyone else wants to work on this, I would share my writings and discuss lines of thought with them.

Between Interpretations

Let's see how different interpretations contrast one another. First, though, it's worth noting the ways in which they can work together.

Dual-Wielding Interpretations

While subjective vs objective Bayesianism is a genuine disagreement—which priors must we use?—there is a sense in which subjective vs objective interpretations of probability are entirely different things, so that Bayesians and frequentists are in fact talking past each other. Because of this, we can simultaneously equip ourselves with tools for dealing with subjective credences and objective frequencies. This only goes so far, though, owing to the fact that subjective probabilists do eventually need to couple their beliefs to the real world, and therefore come into contact with frequentists.

Still, though, a subjectivist has some wiggle room in interpreting how their beliefs ought to couple to reality, and vice versa: David Lewis, most famous for his work on modal realism and the semantics of counterfactuals, attacked this question in A Subjectivist's Guide to Objective Chance ; the most enduring influence of this paper is the Principal Principle, which relates credences to 'chances' (Lewis's term for objective probabilities, viewed from a subjective standpoint): provided one has no "inadmissible evidence" about the chance of some outcome (see paper for details), they should always set their credibility equal to their estimated chance. Eventually, Lewis came to consider this principle as wrong, replacing it with the so-called 'New Principle'; Strevens's paper A Closer Look at the 'New' Principle examines the history of this change and scrutinizes this new principle.

Yudkowsky has a very useful idiom in this regard (I don't know if it originated with him, or where I might locate a source): you should only assign a 5% probability to things when you really think that you'd only be wrong one out of every twenty times you assign such a probability; you should only say that something has above a 99.8% chance of happening when you really think that you could make over five hundred such statements and, on average, be wrong just once. Across many events, your credences ought to match the actual limiting frequencies of events Leader of "Bayesian Conspiracy" Exposed As Frequentist!!
Let's be clear about how this method of coupling credences to frequencies is different from frequentism: you can give a 30% credence to the population of China being above 1.5 billion, but that's not coherent as a frequency: there's no way to repeat your measurement. Looking up the number again will just get you the same number, and China isn't exchangeable with another random country. But the act of making a 30% credence is a repeatable experiment, and therefore does have a frequency; calibration is when the subjective probability converges to the objective probability. This doesn't seem sufficient to say that the subjective probabilities are objective: if some mental flaw causes you to say you're 30% confident in facts about China that only turn out to be true 20% of the time, with other 30% credences of yours coming true more often so as to balance it out, someone who notices this pattern could systematically be correct more often than you. I'm not too sure what to make of this.
. Insofar as they do, you are said to be well-calibrated—while there's a lot of interesting psychology to be discussed here, all I'll point out for now is what calibration looks like, via Scott Alexander's Grading My 2021 Predictions, and how calibration might be trained, via Andrew Critch's page on CFAR's Credence Calibration Game .

Subjective vs Objective Probability

The Bayesianism-frequentism argument is too played out. My heart's just not into it anymore. For a more refreshing perspective, see jsteinhardt's Beyond Bayesians and Frequentists , which discusses where Bayesian techniques happen to outperform frequentist ones, and where they don't, and gives criteria for deciding between the two as well as possible middle grounds. To quote the concluding section:

When the assumptions of Bayes' Theorem hold, and when Bayesian updating can be performed computationally efficiently, then it is indeed tautological that Bayes is the optimal approach. Even when some of these assumptions fail, Bayes can still be a fruitful approach. However, by working under weaker (sometimes even adversarial) assumptions, frequentist approaches can perform well in very complicated domains even with fairly simple models; this is because, with fewer assumptions being made at the outset, less work has to be done to ensure that those assumptions are met.

The Scientific Structure of Probabilities

What probabilities are depends in part on where they come from. For a frequentist, there's not much here—probabilities are mere epiphenomena that arise from distribution of outcomes yielded by repetitions of an experiment. For a subjectivist, we have to ask: where do credences come from?

Andrew Gelman makes the point in the title of the very brief A probability isn't just a number; it's part of a network of conditional statements . This poses a clear issue for models that end up deriving false predictions—how do we track the error down in the large network of relations that scientists often derive their priors from? The standard name for this is the "Duhem problem", after the Duhem-Quine thesis; Deborah Mayo's Duhem's Problem, the Bayesian Way, and Error Statistics, or "What's Belief Got to Do with It?" speaks of general attempts by statisticians to tackle it, and in particular of the problems that this poses for Bayesian inference.

A similar problem is the sensitivity of Bayesian inference to slight variations in the underlying models: Jaynes's The A_p Distribution and Rule of Succession describes the way in which scientists might obtain probabilities from underlying models, and points out how two propositions given the same probability can differ drastically in stability, or the extent to which they change conditional on new evidence. Sometimes this is a good thing, as illustrated by one of his examples—if we solidify hydrogen in the lab at such and such a temperature and pressure, our probability that it'll do so again under the same conditions should not be 2/3, as indicated by a naive use of Laplace's rule of succession. It should be arbitrarily close to 1, because we chunk that knowledge as "hydrogen is a thing that solidifies in these conditions"—as a universal property. Subjective probabilities are outputs of entire causal mechanisms, and must change accordingly.

Owhadi, Scovel, and Sullivan's On the Brittleness of Bayesian Inference points out a problem with this: models might make the ways in which we update on new data "rigid", such that—to quote the abstract—"(1) two practitioners who use arbitrarily close models and observe the same (possibly arbitrarily large amount of) data may reach opposite conclusions; and (2) any given prior and model can be slightly perturbed to achieve any desired posterior conclusions. The mechanism causing brittlenss/robustness suggests that learning and robustness are antagonistic requirements and raises the question of a missing stability condition for using Bayesian Inference in a continuous world under finite information".

Frequentism doesn't get away cleanly, though. Steegen et al.'s Increasing Transparency Through a Multiverse Analysis makes the point that, in actual analysis of data, the probabilistic models you actually choose and the statistical tests you actually run do actually depend on the specific outcomes you got, e.g. when you do exploratory data analysis , or really when you take any data processing steps with degrees of freedom (such as categorizing data, combining variables, transforming variables, excluding some data, and so on) and that this does create bias. They propose "multiverse analyses", where you consider the range of possible results you could've gotten if you had done things differently, scanning this range to see where the most important choices lie and what their presence suggests about your conclusion I have a parallel thesis about objective Bayesianism which I'd like to write more about at some point: it's not an objective prior unless you precommit to using it no matter how poorly it ends up fitting, no matter how ugly the math gets (as happens w/ non-conjugate priors). If you say "we're going to use the Jeffreys prior because it's non-informational", but you are such that you would upon seeing some particular data decide to abandon this choice, your use of the Jeffreys prior is dependent on the data being in a certain way, and therefore informational even if you do in fact end up going through with it! . Related writings on this are Andrew Gelman and Eric Loken's paper The Garden of Forking Paths , Gelman's "If you do not know what you would have done under all possible scenarios, then you cannot know the Type I error rate for your analysis" , and Greenland et al.'s To Aid Scientific Inference, Emphasize Unconditional Compatibility Descriptions of Statistics .

The Use of Probabilities

Reasoning About Your World

A. Probabilistic Reasoning vs. Agency

We use probability to reason about things we wish to influence in the future, as well as about things that have influenced us in the past. The general solution to this might seem simple enough—plug your data and your uncertainty (prior) into your model, apply Bayes, and where applicable take the action that maximizes expected utility w.r.t. your new posterior. But the elegance of this solution is only made possible by a fallacious agent-environment distinction.

Scott Garrabrant makes the point in Bayesian Probability is for things that are Space-like Separated from You . The term comes from special relativity, where space-like separation refers to things outside your lightcone Things you can reach without exceeding the speed of light are said to be time-like separated, and everything else is space-like separated. My mnemonics: you're time-like separated from me when it's only a matter of time before I reach you, and space-like separated when there's too much space between us. . What's inside of your lightcone is your past and your future—the things that you can observe, and the things that you can affect. In the metaphor of logical time Garrabrant employs:, the past consists of possible (resp. actual) events that your actions anticipate (resp. react to), and the future consists of actions that anticipate of (resp. react to) your possible (resp. actual) actions.

The first point is in fact an incredible problem for Bayesian updating in adversarial environments, discussed more thoroughly in Abram Demski's Thinking About Filtered Evidence is (Very!) Hard A note that I have nowhere else to put: Demski's line of thought comes from an interesting line of research on the problem of "logical induction", which attempts to tackle a tricky question: how can we say that a rational agent that knows ZFC should assign a 10% credence to the $10^{10^{100}}$th digit of $\pi$ being zero, when the value of this digit of $\pi$ should in principle follow directly from ZFC? In general, the problem concerns the apportionment of credences to logical statements in a way that updates coherently upon learning of new proofs—see e.g. Garrabrant, Fallenstein, Demski, and Soares in Inductive Coherence for some generalities on the problem of assigning probabilities to mathematical statements that you could in principle just directly prove or disprove. . The general problem, one shared by humans, is a sort of " Cartesian" use of Bayesianism—perhaps best-known as it shows up in AIXI—which relies on a world-model which you impassively observe and act upon as an independent mind; the solution, theoretically, is to have a model that takes fully into account the way in which you are integrated into the world.

Yet this is dangerous for an entirely different reason. While Demski's Embedded Agency is beyond our scope, his The Parable of Predict-O-Matic hints at why we might not want to build any agent whose probabilistic model informs it lets it predict how its actions affect its observations. Insofar as that agent has a utility which depends on its observations, such as making predictions which minimize the error of those observations, it has not just the motive but the ability to figure out how to act in order to increase the utility of what it observes. Even if you just want it to predict things for you—stating predictions is an action, and it can benefit by tailoring or conditionally sharing its predictions in order to manipulate you.

Of course, there are good reasons to think that this is a capacity the agent should have: you want your construct to be able to do the optimal thing to achieve its goal, and then control what it decides to do in order to keep it from subverting or destroying you. If it can't solve this problem, known generally as the naturalized induction problem, it will always be artificially stupid (unless it somehow manages to develop this notion on its own, in the same way that humans sometimes do). There are some proposed solutions to this problem, such as infra-Bayesian physicalism If there are good arguments that infra-Bayesianism says worthwhile things about our fundamental understanding of probabilistic thinking, as opposed to just shaping how we think about decision, learning, etc., I'd be strongly interested in seeing them. , but, without consensus on any one solution, it continues to stay an open problem.

These are some extraordinarily general limitations to the use of Bayesian probabilities. You can say that this or that coherence theorem is why Bayesian agents are the most accurate—but when it turns out that they just aren't, when they're intractably screwed over by such simple twists that humans readily reason about, what do all your proofs, all your theorems, all your mathematical guarantees come to?

In any case, there's another sense in which you might use probability to reason about "your world", one in which the nature of your existence in it is even more fundamental. Anthropics.

B. The Probability of Indexicals

As painful as it is, anthropic reasoning must inform the way we use probabilistic reasoning. When we say that no, the fine-tuning of the universe to support life doesn't imply a fine-tuner, it's a necessary condition of our being here to observe anything at all—we're using anthropics, arguing that even though $P(\text{life}\mid \text{random tuning})$ might be very low, we're not looking down at an arbitrary randomly chosen universe; we're within a universe which must be able to support us Is it the case that the fact of our existence implies that fine-tuning is in fact not the case—that a significant proportion of "possible universes" that arise from varying whatever free parameters end up existing in the (mathematical framework) of the true Theory of Everything—end up containing life? Or does it imply that [???] takes all options at once, whether because of some super-superposition over the universe space or because of a Tegmark IV-type ontology? . When we try to answer Sleeping Beauty-like problems, we can use Bayes' theorem, but it's not clear about whether and how you need to condition on the fact of your own existence. The canonical reference is Bostrom's Anthropic Bias , which discusses such problems in depth—but, for a quick overview of the more terrifying thought experiments, see Scott Aaronson's Fun With the Anthropic Principle (notes from lecture 17 of his quantum computing course).

Either way, radical Bayesians have a problem—if they ignore anthropics, they're fucked because they're incoherently settling a large class of important questions by ignorant dogmatism, and if they don't ignore anthropics, they're fucked because now they have to think about anthropics.

As explained by Stuart Armstrong's Anthropics: different probabilities, different questions shows how different theories of anthropic probability are really answers to different questions, and ata's If a tree falls on Sleeping Beauty... discusses how this obtains in the case of the Sleeping Beauty problem—how those who get answers of either 1/3 and those who get 1/2 are answering different questions.

Reasoning About Data

We want to use probability to reason about what things are true. The theory through which frequentists come to conclusions about what is true looks more like statistics proper, which I won't go into—though, again, see Statistics is Difficult, which points to the unusually interesting divide between Fisher and Neyman-Pearson on the extent to which statistical inference ought to help guide our interpretations of evidence (as opposed to directly guiding our decisions).

For Bayesians, the situation is thornier. Aumann's Agreement Theorem tells us that agents with the same prior who update using Bayes' rule on common knowledge must converge onto the same beliefs, rather than agreeing to disagree—but this never happens. We are not rational enough, quantitatively precise enough, computationally powerful enough to implement it. In this sense, Robin Hanson calls us "Bayesian wannabes", and, in his paper For Bayesian Wannabes, Are Disagreements Not About Information? , demonstrates that such wannabes with the same starting priors will disagree for reasons beyond their simply having different information about the world.

Andrew Gelman, who literally wrote the book on Bayesian Data Analysis, gives two more pragmatic objections to the use of Bayesian inference in his Objections to Bayesian Statistics . First, that Bayesian methods are generally presented as "automatic", when that's just not how statistical modeling tends to work—it's a setting-dependent thing, where models are used highly contextually—and, second, that it's not clear how to assess the kind of subjective knowledge Bayesian probability claims to be about, and science should be concerned with objective knowledge anyway. There's a great paragraph that I'll quote:

As Brad Efron wrote in 1986, Bayesian theory requires a great deal of thought about the given situation to apply sensibly, and recommending that scientists use Bayes’ theorem is like giving the neighborhood kids the key to your F-16. I’d rather start with tried and true methods, and then generalize using something I can trust, such as statistical theory and minimax principles, that don’t depend on your subjective beliefs. Especially when the priors I see in practice are typically just convenient conjugate forms. What a coincidence that, of all the infinite variety of priors that could be chosen, it always seems to be the normal, gamma, beta, etc., that turn out to be the right choices?

Making Good Decisions

Abram Demski's Complete Class: Consequentialist Foundations offers a "more purely consequentialist foundation for decision theory", and a "proposed foundational argument for Bayesianism". He argues that Dutch Books are more like manners of illustrating inconsistencies rather than for demonstrating desiderata for rationality, because beliefs are distinct from actions like betting, requiring a decision theory to link them. Rather, Demski advocates the titular complete class theorem, which states that any decision rule that's Pareto-optimal (or, admissible: has no strictly superior decision rule) can be expressed as Bayesian—namely, as an expected utility maximizer for some prior. As Jessica Taylor's comment points out, though, the "some" in that sentence is doing a lot of work: there are ways for groups to arrive at Pareto-optimal outcomes without utilizing Bayesian methods, and which only happen to be rationalizable in those terms. The prior guaranteed by the theorem might end up looking absolutely insane.

As regards the role of Bayesian inference in decision theories, see also Wei Dai's Why (and why not) Bayesian Updating? , and the paper it discusses, Paolo Ghirardato's Revisiting Savage in a conditional world , which gives a list of seven axioms that, taken together, are "necessary and sufficient for an agent's preferences in a dynamic decision problem to be represented as expected utility maximization with Bayesian belief updating" (quoting the former).

Additional Topics

There are lots of things that I haven't covered here which would be essential to a fuller treatment of the subject. If you want to truly understand how probability is difficult, some other subject-clusters to explore include:

If you've made it to the end, then I hope you got something out of this brief outline. The single message I want to underscore is, again, that probability is difficult. It's important to learn where and when it breaks down, so that you don't make great mistakes in placing so much weight upon such weak foundations—or, worse, build an agent that systematically does so without the ability to ever tell that it's "not supposed" to be doing something pathological. If there's anything I looked over, any errors of fact or attribution, any subject-clusters or resources I should strongly consider including, please let me know.

Footnotes