(Published June 9, 2023 by Mariven)
The idea of probability is one we commonly rely on in order to talk about our world models, especially in communities where their interpretation as credences, or subjective degrees of belief, is widespread. More relevant to artificial intelligence is the manner in which we implicitly bake probability in to our theories of decision and utility, justifying it post-hoc with e.g. the VNM theorem or complete class theorem. But the axioms consistently fail to apply; no matter how hard we grasp, reality always seems to slip away from our models of it. Here, I compile a lot of resources that demonstrate why probability is difficult. Some of the questions I want to point to, and list resources for figuring out answers to, include:
A year and a half ago, I wrote Probability is Difficult, a review of the foundations of probability theory: the interpretations of various intuitive notions of probability, the axiomatic systems through which we use them in mathematical applications, and the various pain points and paradoxes that we have to watch out for if we want to be not just consistent but correct in our use of probabilistic reasoning. The core notion I wanted to impart is in the title: probability is difficult. It's not so simple as setting up numbers and playing with them—you have to couple those numbers consistently to reality, and this is so incredibly hard to do correctly. While little more than a thorough compilation of existing work, writing it was very educational, and I figured it would be worthwhile to compose a LessWrong version tailored to this site's idioms and intents.
In the course of cleaning it up, I found that I could do something much better: make it a guide on the proper and improper use of probability in general. Not just from inside the mathematical perspective—what kinds of mathematics don't break down due to their own logic—but from outside it, the place where you're deciding what mathematics to use and how to use it. Not just how to interpret the use of probability, but how to employ it: our theories of decision, utility, and learning fundamentally depend on probabilistic reasoning, both at the object-level (as when we speak of maximizing expected utility) and at the meta-level (as when we argue about the Solomonoff prior), and this has motivations and consequences; we ought to be aware of when and why it breaks down, which it does surprisingly often.
Unfortunately, I realized after starting to work on this updated version that to sketch this full picture to any adequate level of detail would take far more time than I'm willing to spend. So, all I can do for now is try to outline how someone who wanted to understand exactly why, how, and whereprobability is difficult would go about doing so, by listing various resources, key terms and areas of study, and lines of thought being pursued. I am trying to list resources that, whether by saying worthwhile things or pointing to other worthwhile places, are useful for structuring one's understanding of a given topic.
The main resources I've compiled are given as fifty-something bolded links weaved throughout exposition. (Were it just a plain list, people would go "wow! cool!", bookmark it, and never read it again; I want to give you some idea of the underlying narrative which makes them important to understanding why probability is difficult, so as to actually motivate them).
I'll assume that you already have some knowledge of probability theory, and know the basics about frequentism, the Kolmogorov axioms, Dutch books and Bayesian inference, and so on. If you don't, then, again, the original version of Probability is Difficult is a great place to start.
There are many ways to interpret the idea of 'a probability', the most common of which are:
These are respectively known as the classical, propensity, (finite vs hypothetical) frequency, and (subjective vs objective) Bayesian interpretations. Note that the terms 'subjective' and 'objective' are used in two different ways here: the frequentist and propensity interpretations are objective where the Bayesian is subjective in that they attribute probabilities as being internal to the object, i.e. the world, rather than the subject, the one reasoning about the world. Bayesianism's subjective-vs-objective split is about the extent to which the reasoner is forced to construct their internal probabilities in a single correct way based on their prior knowledge. (The update from prior to posterior given some data is always the same, though).
A very quick and marginally illustrated overview of the main interpretations of probability is given by Nate Soares' Correspondence visualizations for different interpretations of "probability", which fits in a trilogy of tinyposts about probability interpretation.
A much larger, more thorough discussion is given by the Stanford Encyclopedia of Probability's page. As expected, though, it's largely embedded in the explicitly philosophical literature, filled with citations so as to mention every philosopher's opinion no matter how ill-informed or irrelevant. Still, though, it's a pretty good place to get your bearings. An even better discussion is given in the excellent review Probability is Difficult. (The author remains pseudonymous, but the stunning combination of philosophical deftness and technical expertise speaks volumes as to the author's erudition, lucidity, and, above all, humility).
If one wants a textbook-length explanation from a philosophical standpoint, there's Donald Gillies' Philosophical Theories of Probability, which also attempts to answer the question of when and where we should use this or that interpretation—see e.g. Ch. 9's "General arguments for interpreting probabilities in economics as epistemological rather than objective".
There are other ways to slice up the subject beyond these four interpretations: see e.g. the Wikipedia page on aleatoric and epistemic uncertainty; these correspond roughly to frequentist and Bayesian notions, but take a slightly different point of view that keeps the correspondence from being a clean one; just the same, there are ways to slice up the subject that fall neatly within these interpretations and should not be confused with 'probability'. Likelihood is one such notion, its differentiation vs probability being explained by Abram Demski's Probability vs Likelihood, as well as the Wikipedia page on likelihood functions.
If you want to know how to think like a Bayesian, look no further than the bibliography of E. T. Jaynes. The father of both modern Bayesianism and the acerbic overconfidence of its proponents, his textbook Probability Theory: The Logic of Science builds the entire subject of probability theory as a connected edifice of techniques, heuristics, and formulas for accurately modifying and applying your beliefs about the real world (link is to a pdf; see this LW post of the same name for a much shorter review/walkthrough).
Two more conceptually oriented texts on subjective reasoning by notorious Bayesians are Bruno de Finetti's Theory of Probability and I. J. Good's The Foundation of Probability and its Applications; these look more carefully at where probability and its logic fundamentally come from, and the latter in particular takes care to demarcate questions of probability from those of statistics, utility, and decision. Chapter 3 of the latter, in particular, is a (two-page) article entitled 46,656 Varieties of Bayesians, which—as you might have instinctively guessed from the number—is a combinatorial division of kinds of Bayesianism based on their answers to several different questions.
				Jaynes spent most of his career dunking on frequentists, and occasionally this produced useful observations on the Bayesian philosophy and approach
				
Bayesianism's biggest, most notable problem—whence priors?—is also the source of its largest split, between objective Bayesians and subjective Bayesians. Again, note the distinction between objective Bayesianism and objective probability interpretations such as frequentism. Jaynes, an objective Bayesian, describes what the term means in his paper Probability in Quantum Theory:
Our probabilities and the entropies based on them are indeed "subjective" in the sense that they represent human information; if they did not, they could not serve their purpose. But they are completely "objective" in the sense that they are determined by the information specifed, independently of anybody's personality, opinions, or hopes. It is "objectivity" in this sense that we need if information is ever to be a sound basis for new theoretical developments in science.
And again in his Prior Probabilities and Transformation Groups:
A prior probability assignment not based on frequencies is necessarily "subjective" in the sense that it describes a state of knowledge, rather than anything which could be measured directly in an experiment. But if our methods are to have any relevance to science, the prior distribution must be completely "objective" in the sense that it is independent of the personality of the user; i.e., it should describe the prior information, and not anybody's personal feelings.
The original objective Bayesians applied the principle of indifference to get priors: when you want to estimate some parameter in some range, start with a prior that is constant over the parameter space, giving equal weight to all options. This was famously called into question by Bertrand's paradox, which showed that slightly different ways of constructing the exact same parameter space can give different uniform priors, as well as by the similar wine/water paradox, which makes the same point in an even more insoluble manner. Jaynes's paper The Well-Posed Problem attempts to save the principle of indifference by showing that there is in fact a single unique way to be indifferent to the parameter in Bertrand's paradox; he nevertheless admits that his transformation group approach cannot solve the wine/water paradox. (And his solution for the former is problematic as well: as explained by Alon Drory's Failure and Uses of Jaynes' Principle of Transformation Groups, the method by which Jaynes supposedly found a single canonical solution can be adjusted to find each of the other two solutions as well).
The modern, high-tech version of the principle of indifference is the Jeffreys prior, which is proportional to the square root of the determinant of the Fisher Information matrix; it is invariant under reparametrization, and thereby manages to avoid bias in the wine/water paradox. In practice, though, Jeffreys priors usually tend to be non-normalizable, as in most of the examples on the Wikipedia page. (This doesn't dissuade everyone, though: see Andy Jones's Improper Priors for a tutorial on how they can be used to derive proper posteriors).
Another attempt by Jaynes to construct objective priors for Bayesian analysis is the principle of maximum entropy, or MaxEnt. His paper Notes on Present Status and Future Prospects talks about the role played by the Maximum Entropy principle in objective Bayesianism, and, more generally, discusses why it's important to reason about states of knowledge rather than the world directly. Most urgently, see his Where do we go from here? for an account of a rap battle concerning the objectivity of MaxEnt.
For a much larger exposition on the problems with objective Bayesianism, see Elliott Sober's Bayesianism: Its Scope and Limits.
				If you want to see how frequentists think, most standard statistics textbooks should do. Owing to the dominant role played by frequentists in the foundations of statistical inference, and the fact that the foundations of probability are usually just taught as a sidenote in most statistics classes, people hardly learn to work with probability except in the frequentist context where you have some experiment, some outcome, and you want to pick some test statistic (function that computes the 'extremeness' of some aspect of the data) in order to establish some sort of confidence interval or reject some null hypothesis with some low p-value (probability that the null hypothesis would give data w/ a test statistic at least as extreme as the given one
				
				Foundational issues aren't really discussed all that much by frequentists; the biggest split among them is not between finite and hypothetical frequentists, but between Fisherian significance testing and Neyman-Pearsonian hypothesis testing (for which see my sequel Statistics is Difficult). In fact, Alan Hajek, in his Fifteen Arguments Against Finite Frequentism calls it "about as close to being refuted as a serious philosophical position ever gets"
				
It seems to me that the propensity interpretation is primarily discussed by philosophers, so there can't be much of use there. Again, my original article goes over what little there is to discuss. However, I'll take this opportunity to discuss an objective notion of probability that behaves more or less like propensity: quantum-mechanical probability.
				I haven't actually seen much discussion on the role that quantum mechanics should play in our understanding of probability—most discussion goes the other way around. When we measure the spin of an electron in a Stern-Gerlach experiment, the outcome seems to be random, with each spin (up or down) being equally likely for each electron. The canonical quantum formalism tells us that this does not come from a real-valued probability distribution over the space of possibilities, but from a complex wavefunction over this space
				
				This wavefunction is, in a sense, the "square root" of a probability distribution, in a sense made precise by the Born rule
				
				To understand this question—what notion of probability obtains in actual reality?—is obviously an advantage: while such real probabilities operate independent of epistemic probabilities
				
To be honest, though, I do not find these arguments for a quantum understanding of probability strong enough to decide that working on this is worth the time it would take. If there are good reasons, I'd like to see them—or, if anyone else wants to work on this, I would share my writings and discuss lines of thought with them.
Let's see how different interpretations contrast one another. First, though, it's worth noting the ways in which they can work together.
While subjective vs objective Bayesianism is a genuine disagreement—which priors must we use?—there is a sense in which subjective vs objective interpretations of probability are entirely different things, so that Bayesians and frequentists are in fact talking past each other. Because of this, we can simultaneously equip ourselves with tools for dealing with subjective credences and objective frequencies. This only goes so far, though, owing to the fact that subjective probabilists do eventually need to couple their beliefs to the real world, and therefore come into contact with frequentists.
Still, though, a subjectivist has some wiggle room in interpreting how their beliefs ought to couple to reality, and vice versa: David Lewis, most famous for his work on modal realism and the semantics of counterfactuals, attacked this question in A Subjectivist's Guide to Objective Chance ; the most enduring influence of this paper is the Principal Principle, which relates credences to 'chances' (Lewis's term for objective probabilities, viewed from a subjective standpoint): provided one has no "inadmissible evidence" about the chance of some outcome (see paper for details), they should always set their credibility equal to their estimated chance. Eventually, Lewis came to consider this principle as wrong, replacing it with the so-called 'New Principle'; Strevens's paper A Closer Look at the 'New' Principle examines the history of this change and scrutinizes this new principle.
				Yudkowsky has a very useful idiom in this regard (I don't know if it originated with him, or where I might locate a source): you should only assign a 5% probability to things when you really think that you'd only be wrong one out of every twenty times you assign such a probability; you should only say that something has above a 99.8% chance of happening when you really think that you could make over five hundred such statements and, on average, be wrong just once. Across many events, your credences ought to match the actual limiting frequencies of events
					Leader of "Bayesian Conspiracy" Exposed As Frequentist!!
						Let's be clear about how this method of coupling credences to frequencies is different from frequentism: you can give a 30% credence to the population of China being above 1.5 billion, but that's not coherent as a frequency: there's no way to repeat your measurement. Looking up the number again will just get you the same number, and China isn't exchangeable with another random country. But the act of making a 30% credence is a repeatable experiment, and therefore does have a frequency; calibration is when the subjective probability converges to the objective probability.
						
The Bayesianism-frequentism argument is too played out. My heart's just not into it anymore. For a more refreshing perspective, see jsteinhardt's Beyond Bayesians and Frequentists , which discusses where Bayesian techniques happen to outperform frequentist ones, and where they don't, and gives criteria for deciding between the two as well as possible middle grounds. To quote the concluding section:
When the assumptions of Bayes' Theorem hold, and when Bayesian updating can be performed computationally efficiently, then it is indeed tautological that Bayes is the optimal approach. Even when some of these assumptions fail, Bayes can still be a fruitful approach. However, by working under weaker (sometimes even adversarial) assumptions, frequentist approaches can perform well in very complicated domains even with fairly simple models; this is because, with fewer assumptions being made at the outset, less work has to be done to ensure that those assumptions are met.
What probabilities are depends in part on where they come from. For a frequentist, there's not much here—probabilities are mere epiphenomena that arise from distribution of outcomes yielded by repetitions of an experiment. For a subjectivist, we have to ask: where do credences come from?
Andrew Gelman makes the point in the title of the very brief A probability isn't just a number; it's part of a network of conditional statements . This poses a clear issue for models that end up deriving false predictions—how do we track the error down in the large network of relations that scientists often derive their priors from? The standard name for this is the "Duhem problem", after the Duhem-Quine thesis; Deborah Mayo's Duhem's Problem, the Bayesian Way, and Error Statistics, or "What's Belief Got to Do with It?" speaks of general attempts by statisticians to tackle it, and in particular of the problems that this poses for Bayesian inference.
A similar problem is the sensitivity of Bayesian inference to slight variations in the underlying models: Jaynes's The A_p Distribution and Rule of Succession describes the way in which scientists might obtain probabilities from underlying models, and points out how two propositions given the same probability can differ drastically in stability, or the extent to which they change conditional on new evidence. Sometimes this is a good thing, as illustrated by one of his examples—if we solidify hydrogen in the lab at such and such a temperature and pressure, our probability that it'll do so again under the same conditions should not be 2/3, as indicated by a naive use of Laplace's rule of succession. It should be arbitrarily close to 1, because we chunk that knowledge as "hydrogen is a thing that solidifies in these conditions"—as a universal property. Subjective probabilities are outputs of entire causal mechanisms, and must change accordingly.
Owhadi, Scovel, and Sullivan's On the Brittleness of Bayesian Inference points out a problem with this: models might make the ways in which we update on new data "rigid", such that—to quote the abstract—"(1) two practitioners who use arbitrarily close models and observe the same (possibly arbitrarily large amount of) data may reach opposite conclusions; and (2) any given prior and model can be slightly perturbed to achieve any desired posterior conclusions. The mechanism causing brittlenss/robustness suggests that learning and robustness are antagonistic requirements and raises the question of a missing stability condition for using Bayesian Inference in a continuous world under finite information".
			Frequentism doesn't get away cleanly, though. Steegen et al.'s Increasing Transparency Through a Multiverse Analysis makes the point that, in actual analysis of data, the probabilistic models you actually choose and the statistical tests you actually run do actually depend on the specific outcomes you got, e.g. when you do exploratory data analysis , or really when you take any data processing steps with degrees of freedom (such as categorizing data, combining variables, transforming variables, excluding some data, and so on) and that this does create bias. They propose "multiverse analyses", where you consider the range of possible results you could've gotten if you had done things differently, scanning this range to see where the most important choices lie and what their presence suggests about your conclusion
				
We use probability to reason about things we wish to influence in the future, as well as about things that have influenced us in the past. The general solution to this might seem simple enough—plug your data and your uncertainty (prior) into your model, apply Bayes, and where applicable take the action that maximizes expected utility w.r.t. your new posterior. But the elegance of this solution is only made possible by a fallacious agent-environment distinction.
				Scott Garrabrant makes the point in Bayesian Probability is for things that are Space-like Separated from You . The term comes from special relativity, where space-like separation refers to things outside your lightcone
					
				The first point is in fact an incredible problem for Bayesian updating in adversarial environments, discussed more thoroughly in Abram Demski's Thinking About Filtered Evidence is (Very!) Hard
					
Yet this is dangerous for an entirely different reason. While Demski's Embedded Agency is beyond our scope, his The Parable of Predict-O-Matic hints at why we might not want to build any agent whose probabilistic model informs it lets it predict how its actions affect its observations. Insofar as that agent has a utility which depends on its observations, such as making predictions which minimize the error of those observations, it has not just the motive but the ability to figure out how to act in order to increase the utility of what it observes. Even if you just want it to predict things for you—stating predictions is an action, and it can benefit by tailoring or conditionally sharing its predictions in order to manipulate you.
				Of course, there are good reasons to think that this is a capacity the agent should have: you want your construct to be able to do the optimal thing to achieve its goal, and then control what it decides to do in order to keep it from subverting or destroying you. If it can't solve this problem, known generally as the naturalized induction problem, it will always be artificially stupid (unless it somehow manages to develop this notion on its own, in the same way that humans sometimes do). There are some proposed solutions to this problem, such as infra-Bayesian physicalism
					
These are some extraordinarily general limitations to the use of Bayesian probabilities. You can say that this or that coherence theorem is why Bayesian agents are the most accurate—but when it turns out that they just aren't, when they're intractably screwed over by such simple twists that humans readily reason about, what are all your proofs, all your theorems, and all your mathematical guarantees even doing?
In any case, there's another sense in which you might use probability to reason about "your world", one in which the nature of your existence in it is even more fundamental. Anthropics.
					As painful as it is, anthropic reasoning must inform the way we use probabilistic reasoning. When we say that no, the fine-tuning of the universe to support life doesn't imply a fine-tuner, it's a necessary condition of our being here to observe anything at all—we're using anthropics, arguing that even though $P(\text{life}\mid \text{random tuning})$ might be very low, we're not looking down at an arbitrary randomly chosen universe; we're within a universe which must be able to support us 
Either way, radical Bayesians have a problem: if they ignore anthropics, they're screwed because they're incoherently settling a large class of important questions by ignorant dogmatism. If they don't ignore anthropics, they're screwed because now they have to think about anthropics.
As explained by Stuart Armstrong's Anthropics: different probabilities, different questions shows how different theories of anthropic probability are really answers to different questions, and ata's If a tree falls on Sleeping Beauty... discusses how this obtains in the case of the Sleeping Beauty problem—how those who get answers of either 1/3 and those who get 1/2 are answering different questions.
We want to use probability to reason about what things are true. The theory through which frequentists come to conclusions about what is true looks more like statistics proper, which I won't go into—though, again, see Statistics is Difficult, which points to the unusually interesting divide between Fisher and Neyman-Pearson on the extent to which statistical inference ought to help guide our interpretations of evidence (as opposed to directly guiding our decisions).
For Bayesians, the situation is thornier. Aumann's Agreement Theorem tells us that agents with the same prior who update using Bayes' rule on common knowledge must converge onto the same beliefs, rather than agreeing to disagree—but this never happens. We are not rational enough, quantitatively precise enough, computationally powerful enough to implement it. In this sense, Robin Hanson calls us "Bayesian wannabes", and, in his paper For Bayesian Wannabes, Are Disagreements Not About Information? , demonstrates that such wannabes with the same starting priors will disagree for reasons beyond their simply having different information about the world.
Andrew Gelman, who literally wrote the book on Bayesian Data Analysis, gives two more pragmatic objections to the use of Bayesian inference in his Objections to Bayesian Statistics . First, that Bayesian methods are generally presented as "automatic", when that's just not how statistical modeling tends to work—it's a setting-dependent thing, where models are used highly contextually—and, second, that it's not clear how to assess the kind of subjective knowledge Bayesian probability claims to be about, and science should be concerned with objective knowledge anyway. There's a great paragraph that I'll quote:
As Brad Efron wrote in 1986, Bayesian theory requires a great deal of thought about the given situation to apply sensibly, and recommending that scientists use Bayes’ theorem is like giving the neighborhood kids the key to your F-16. I’d rather start with tried and true methods, and then generalize using something I can trust, such as statistical theory and minimax principles, that don’t depend on your subjective beliefs. Especially when the priors I see in practice are typically just convenient conjugate forms. What a coincidence that, of all the infinite variety of priors that could be chosen, it always seems to be the normal, gamma, beta, etc., that turn out to be the right choices?
Abram Demski's Complete Class: Consequentialist Foundations offers a "more purely consequentialist foundation for decision theory", and a "proposed foundational argument for Bayesianism". He argues that Dutch Books are more like manners of illustrating inconsistencies rather than for demonstrating desiderata for rationality, because beliefs are distinct from actions like betting, requiring a decision theory to link them. Rather, Demski advocates the titular complete class theorem, which states that any decision rule that's Pareto-optimal (or, admissible: has no strictly superior decision rule) can be expressed as Bayesian—namely, as an expected utility maximizer for some prior. As Jessica Taylor's comment points out, though, the "some" in that sentence is doing a lot of work: there are ways for groups to arrive at Pareto-optimal outcomes without utilizing Bayesian methods, and which only happen to be rationalizable in those terms. The prior guaranteed by the theorem might end up looking absolutely insane.
As regards the role of Bayesian inference in decision theories, see also Wei Dai's Why (and why not) Bayesian Updating? , and the paper it discusses, Paolo Ghirardato's Revisiting Savage in a conditional world , which gives a list of seven axioms that, taken together, are "necessary and sufficient for an agent's preferences in a dynamic decision problem to be represented as expected utility maximization with Bayesian belief updating" (quoting the former).
There are lots of things that I haven't covered here which would be essential to a fuller treatment of the subject. If you want to truly understand how probability is difficult, some other subject-clusters to explore include:
To summarize: probability is difficult. It's important to learn where and when it breaks down, so that you don't make great mistakes in placing so much weight upon such weak foundations—or, worse, build an agent that systematically does so without the ability to ever tell that it's "not supposed" to be doing something pathological.