57  Probability theory

Quantifying uncertainty

A component comes off a production line. You don’t know whether it’s defective. You test 1000 and 23 fail. What can you say about the next one? About the next batch of 500?

These questions have precise answers. Getting them requires a framework for describing uncertainty — assigning likelihoods to outcomes, combining them consistently, and extracting the quantities that are actually useful.

57.1 Probability axioms

57.1.1 Sample space and events

The sample space \(\Omega\) (read: “omega”) is the complete set of possible outcomes of an experiment. For a coin flip, \(\Omega = \{\text{heads}, \text{tails}\}\). For rolling a six-sided die, \(\Omega = \{1, 2, 3, 4, 5, 6\}\). For the lifetime of a component in hours, \(\Omega = [0, \infty)\).

An event \(A\) is any subset of \(\Omega\) — a collection of outcomes you care about. The event “roll an even number” is the subset \(\{2, 4, 6\} \subset \Omega\).

57.1.2 Kolmogorov axioms

A probability measure \(P\) assigns a number to each event. Three axioms define what that assignment must satisfy:

  1. Non-negativity: \(P(A) \geq 0\) for every event \(A\).
  2. Normalisation: \(P(\Omega) = 1\) — something must happen.
  3. Additivity: For mutually exclusive events \(A\) and \(B\) (events that cannot both occur), \(P(A \cup B) = P(A) + P(B)\).

These axioms are the whole foundation. Everything else is a consequence.

From them you can show:

  • \(P(\emptyset) = 0\) (the impossible event has probability zero)
  • \(P(A^c) = 1 - P(A)\), where \(A^c\) is the complement of \(A\) (everything not in \(A\))
  • If \(A \subset B\) then \(P(A) \leq P(B)\)
  • The inclusion-exclusion rule: \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)

The third axiom extends to any countable collection of mutually exclusive events: \(P(A_1 \cup A_2 \cup \cdots) = P(A_1) + P(A_2) + \cdots\)

57.1.3 Conditional probability

The conditional probability of \(A\) given \(B\) — written \(P(A \mid B)\), read “probability of \(A\) given \(B\)” — is defined as:

\[P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0\]

This is the fraction of \(B\)’s probability that overlaps with \(A\). If you know \(B\) has occurred, you’re restricting attention to the outcomes in \(B\) and asking what fraction of those are also in \(A\).

Two events \(A\) and \(B\) are independent if knowing \(B\) occurred tells you nothing about \(A\):

\[P(A \mid B) = P(A) \iff P(A \cap B) = P(A) \cdot P(B)\]

57.1.4 Bayes’ theorem

Rearranging the definition of conditional probability:

\[P(A \cap B) = P(A \mid B) \cdot P(B) = P(B \mid A) \cdot P(A)\]

Solving for \(P(A \mid B)\):

\[P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}\]

This is Bayes’ theorem. It lets you reverse the conditioning: if you know \(P(B \mid A)\), you can compute \(P(A \mid B)\).

The denominator \(P(B)\) is expanded using the total probability rule. If \(A_1, A_2, \ldots, A_n\) partition \(\Omega\) (exhaustive, mutually exclusive), then:

\[P(B) = \sum_{i=1}^{n} P(B \mid A_i) \cdot P(A_i)\]

Why this works

Bayes’ theorem doesn’t involve any new mathematics — it’s just two ways of writing the same joint probability \(P(A \cap B)\), set equal. What makes it powerful is the direction: you have the likelihood \(P(B \mid A)\) (how probable is the evidence given the hypothesis?) and you want \(P(A \mid B)\) (how probable is the hypothesis given the evidence?). Bayes is the bridge between them.

Worked example: Medical test. A disease affects 1% of the population. A test for it has sensitivity 99% (correctly identifies 99% of sick patients) and specificity 95% (correctly clears 95% of healthy patients). A randomly selected person tests positive. What is the probability they actually have the disease?

Define: - \(D\): the person has the disease - \(T^+\): the test is positive

Given: \(P(D) = 0.01\), \(P(T^+ \mid D) = 0.99\) (sensitivity), \(P(T^+ \mid D^c) = 0.05\)

Note also that \(P(D^c) = 1 - P(D) = 1 - 0.01 = 0.99\) — the complement of the prevalence, which happens to equal the sensitivity by coincidence.

First compute the total probability of a positive test, using the two ways a positive result can occur:

\[P(T^+) = P(T^+ \mid D) \cdot P(D) + P(T^+ \mid D^c) \cdot P(D^c)\] \[= \underbrace{0.99}_{\text{sensitivity}} \times 0.01 + 0.05 \times \underbrace{0.99}_{P(D^c) = 1 - 0.01} = 0.0099 + 0.0495 = 0.0594\]

Now apply Bayes:

\[P(D \mid T^+) = \frac{P(T^+ \mid D) \cdot P(D)}{P(T^+)} = \frac{0.99 \times 0.01}{0.0594} \approx 0.167\]

Even with a positive result from a 99%-sensitive test, the probability of actually having the disease is only about 17%. This is not a failure of the test — it’s a consequence of the disease being rare. Most of the positive tests come from the large healthy population, not the small sick one. The prior probability \(P(D)\) matters enormously.

57.2 Random variables

A random variable \(X\) is a function from the sample space \(\Omega\) to the real numbers — it assigns a numerical value to each outcome. The randomness is in the outcome; the function is deterministic.

Discrete random variables take values in a countable set \(\{x_1, x_2, \ldots\}\). The distribution is completely described by the probability mass function (PMF):

\[p(x_i) = P(X = x_i), \quad \sum_i p(x_i) = 1\]

Continuous random variables take values in an interval (or union of intervals). The distribution is described by the probability density function (PDF) \(f(x)\), where:

\[P(a \leq X \leq b) = \int_a^b f(x)\, dx, \quad \int_{-\infty}^{\infty} f(x)\, dx = 1\]

Note that \(f(x)\) is not a probability — it can exceed 1. It’s a density: the probability of \(X\) falling in a small interval \([x, x + dx]\) is approximately \(f(x)\, dx\).

A concrete example: if \(X \sim \text{Uniform}(0,\, 0.5)\), then \(f(x) = 2\) everywhere on \([0, 0.5]\). The density is 2, yet \(P(0 \leq X \leq 0.5) = \int_0^{0.5} 2\, dx = 1\). The density can exceed 1; the integral over any interval is still between 0 and 1.

The cumulative distribution function (CDF) is defined for both types:

\[F(x) = P(X \leq x)\]

For continuous \(X\): \(F(x) = \int_{-\infty}^{x} f(t)\, dt\), and \(f(x) = F'(x)\).

57.2.1 Expectation

The expected value \(E[X]\) — also written \(\mu\) (mu) — is the long-run average of \(X\) over many repetitions:

Discrete: \(\displaystyle E[X] = \sum_i x_i \, p(x_i)\)

Continuous: \(\displaystyle E[X] = \int_{-\infty}^{\infty} x \, f(x)\, dx\)

Expectation is linear: \(E[aX + b] = a\,E[X] + b\) for constants \(a\), \(b\).

For a function \(g(X)\):

\[E[g(X)] = \sum_i g(x_i)\, p(x_i) \quad \text{(discrete)}\]

\[E[g(X)] = \int_{-\infty}^{\infty} g(x)\, f(x)\, dx \quad \text{(continuous)}\]

57.2.2 Variance

The variance \(\text{Var}(X)\) — also written \(\sigma^2\) (sigma squared) — measures the average squared deviation from the mean:

\[\text{Var}(X) = E\!\left[(X - \mu)^2\right] = E[X^2] - (E[X])^2\]

The second form is usually easier to compute. The derivation:

\[E\!\left[(X - \mu)^2\right] = E[X^2 - 2\mu X + \mu^2] = E[X^2] - 2\mu\,E[X] + \mu^2\]

Since \(E[X] = \mu\), the last two terms combine: \(-2\mu \cdot \mu + \mu^2 = -2\mu^2 + \mu^2 = -\mu^2\). So:

\[= E[X^2] - \mu^2\]

The standard deviation \(\sigma = \sqrt{\text{Var}(X)}\) is in the same units as \(X\), which makes it interpretable.

Variance scales with the square: \(\text{Var}(aX + b) = a^2\,\text{Var}(X)\).

For independent random variables \(X\) and \(Y\): \(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)\).

57.3 Key discrete distributions

57.3.1 Bernoulli

A single trial with probability \(p\) of success, \(1-p\) of failure.

\[P(X = 1) = p, \quad P(X = 0) = 1 - p\]

\[E[X] = p, \quad \text{Var}(X) = p(1-p)\]

The Bernoulli is the building block for everything that follows.

57.3.2 Binomial

\(n\) independent Bernoulli trials, each with success probability \(p\). \(X\) counts the total number of successes.

\[P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, \ldots, n\]

The binomial coefficient \(\binom{n}{k}\) — read “n choose k” — counts the number of ways to arrange \(k\) successes among \(n\) trials:

\[\binom{n}{k} = \frac{n!}{k!(n-k)!}\]

Mean and variance. Since \(X\) is the sum of \(n\) independent Bernoulli trials, linearity of expectation and variance additivity give:

\[E[X] = np, \quad \text{Var}(X) = np(1-p)\]

Normal approximation. When \(np \geq 5\) and \(n(1-p) \geq 5\), the binomial is well approximated by a normal distribution with the same mean and variance. This is one preview of the CLT.

57.3.3 Poisson

Models the number of events occurring in a fixed interval of time or space, when events occur independently at a constant average rate.

\[P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \ldots\]

The parameter \(\lambda > 0\) — read “lambda” — is both the mean and the variance:

\[E[X] = \lambda, \quad \text{Var}(X) = \lambda\]

As a limit of Binomial. Set \(\lambda = np\) and let \(n \to \infty\), \(p \to 0\). The binomial PMF converges to the Poisson. This is why the Poisson appears when events are rare but trials are many: the number of typing errors per page, radioactive decays per second, server requests per minute.

Derivation of the mean. Using the Poisson PMF:

\[E[X] = \sum_{k=0}^{\infty} k \cdot \frac{\lambda^k e^{-\lambda}}{k!} = e^{-\lambda} \sum_{k=1}^{\infty} \frac{\lambda^k}{(k-1)!} = e^{-\lambda} \cdot \lambda \sum_{j=0}^{\infty} \frac{\lambda^j}{j!} = e^{-\lambda} \cdot \lambda \cdot e^{\lambda} = \lambda\]

57.4 Key continuous distributions

57.4.1 Uniform

\(X \sim \text{Uniform}(a, b)\) — every value in \([a,b]\) is equally likely.

\[f(x) = \frac{1}{b-a}, \quad a \leq x \leq b\]

\[E[X] = \frac{a+b}{2}, \quad \text{Var}(X) = \frac{(b-a)^2}{12}\]

The CDF is: \(F(x) = (x-a)/(b-a)\) for \(a \leq x \leq b\).

57.4.2 Exponential

Models the time until the first event in a Poisson process — waiting times, component lifetimes, time between arrivals.

\[f(x) = \lambda e^{-\lambda x}, \quad x \geq 0\]

\[F(x) = 1 - e^{-\lambda x}\]

Derivation of mean. Integrate by parts:

\[E[X] = \int_0^{\infty} x \cdot \lambda e^{-\lambda x}\, dx\]

Let \(u = x\), \(dv = \lambda e^{-\lambda x}\, dx\). Then \(du = dx\), \(v = -e^{-\lambda x}\).

\[E[X] = \left[-x e^{-\lambda x}\right]_0^{\infty} + \int_0^{\infty} e^{-\lambda x}\, dx = 0 + \left[-\frac{1}{\lambda} e^{-\lambda x}\right]_0^{\infty} = \frac{1}{\lambda}\]

Derivation of variance. First compute \(E[X^2]\):

\[E[X^2] = \int_0^{\infty} x^2 \cdot \lambda e^{-\lambda x}\, dx = \frac{2}{\lambda^2}\]

The integration follows the same pattern as the derivation of \(E[X]\) above — integration by parts applied twice, or equivalently by recognising the integral as a gamma function. Either route gives \(2/\lambda^2\).

\[\text{Var}(X) = E[X^2] - (E[X])^2 = \frac{2}{\lambda^2} - \frac{1}{\lambda^2} = \frac{1}{\lambda^2}\]

Memoryless property. The exponential is the only continuous distribution with no memory:

\[P(X > s + t \mid X > s) = P(X > t) \quad \text{for all } s, t \geq 0\]

If the component has already survived \(s\) hours, the probability it survives another \(t\) hours is the same as if it were brand new. The past waiting time gives no information about the future.

Proof. Using the survival function \(P(X > x) = e^{-\lambda x}\):

\[P(X > s+t \mid X > s) = \frac{P(X > s+t)}{P(X > s)} = \frac{e^{-\lambda(s+t)}}{e^{-\lambda s}} = e^{-\lambda t} = P(X > t) \quad \checkmark\]

57.4.3 Normal

\(X \sim N(\mu, \sigma^2)\) — the most important distribution in applied probability, for reasons the CLT will make clear.

\[f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\]

Parameters: \(\mu\) is the mean, \(\sigma^2\) is the variance.

\[E[X] = \mu, \quad \text{Var}(X) = \sigma^2\]

There is no closed form for the CDF — it’s evaluated numerically and tabulated as the standard normal CDF \(\Phi(z)\), where \(\Phi\) (capital phi) is the CDF of \(Z \sim N(0,1)\).

Standardisation. Any normal \(X \sim N(\mu, \sigma^2)\) can be converted to the standard normal by:

\[Z = \frac{X - \mu}{\sigma}\]

Then \(P(X \leq x) = P\!\left(Z \leq \frac{x-\mu}{\sigma}\right) = \Phi\!\left(\frac{x-\mu}{\sigma}\right)\).

The transformation \(Z = (X - \mu)/\sigma\) — “subtract the mean, divide by the standard deviation” — centres the distribution at zero and scales it to unit variance. Every probability question about \(X \sim N(\mu, \sigma^2)\) reduces to looking up a value in the standard normal table.

57.5 Joint distributions

For two random variables \(X\) and \(Y\), the joint distribution describes their behaviour together. The key concept for most applications is independence.

\(X\) and \(Y\) are independent if knowing the value of one gives no information about the other:

\[f_{X,Y}(x,y) = f_X(x) \cdot f_Y(y)\]

That is, the joint density factors into the product of the marginals.

Covariance. A measure of how \(X\) and \(Y\) move together:

\[\text{Cov}(X, Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY] - E[X]\,E[Y]\]

If \(X\) and \(Y\) are independent, \(E[XY] = E[X]\,E[Y]\), so \(\text{Cov}(X,Y) = 0\). (The converse is not generally true: zero covariance does not imply independence.)

Correlation coefficient. Covariance is scale-dependent — multiplying \(X\) by 2 doubles the covariance. The correlation normalises this:

\[\rho(X,Y) = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}, \quad -1 \leq \rho \leq 1\]

\(\rho = 1\) means perfect positive linear relationship; \(\rho = -1\) means perfect negative; \(\rho = 0\) means no linear relationship.

Variance of a sum. For any \(X\) and \(Y\) (not necessarily independent):

\[\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\,\text{Cov}(X,Y)\]

If they are independent, the covariance term vanishes.

57.6 Central Limit Theorem

Let \(X_1, X_2, \ldots, X_n\) be independent, identically distributed random variables with mean \(\mu\) and variance \(\sigma^2 < \infty\). Define the sample mean:

\[\bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i\]

Central Limit Theorem (CLT): As \(n \to \infty\),

\[\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} N(0,1)\]

Read \(\xrightarrow{d}\) as “converges in distribution to” — as \(n\) grows, the distribution of the standardised mean gets closer and closer to the standard normal.

In practice: for large \(n\), the distribution of \(\bar{X}_n\) is approximately \(N(\mu,\, \sigma^2/n)\).

Three things the CLT is saying:

  1. The sample mean converges to the true mean \(\mu\) — not just as a hope, but with a rate: the spread is \(\sigma/\sqrt{n}\), shrinking like \(1/\sqrt{n}\).

  2. The limiting distribution is always normal, regardless of the distribution of the individual \(X_i\). You do not need \(X_i\) to be normal — exponential, uniform, Bernoulli, anything — the average becomes normal.

  3. The only requirements are: independent, identically distributed (abbreviated i.i.d.), finite variance. No further conditions.

Conditions for practical use. The approximation is usually adequate when \(n \geq 30\), and excellent for \(n \geq 50\). For distributions that are already close to normal, smaller \(n\) suffices. For very skewed distributions (e.g. heavy-tailed), larger \(n\) may be needed.

Worked example: Sample mean from an exponential population.

Components have lifetimes \(X_i \sim \text{Exp}(\lambda = 0.5)\), so \(\mu = 1/\lambda = 2\) hours and \(\sigma^2 = 1/\lambda^2 = 4\).

Take a sample of \(n = 50\) components. By the CLT, the sample mean is approximately:

\[\bar{X}_{50} \approx N\!\left(2,\, \frac{4}{50}\right) = N(2,\, 0.08)\]

Standard deviation of the mean: \(\sigma/\sqrt{n} = 2/\sqrt{50} \approx 0.283\).

Find \(P(\bar{X}_{50} > 2.4)\):

\[P(\bar{X}_{50} > 2.4) = P\!\left(Z > \frac{2.4 - 2}{0.283}\right) = P(Z > 1.414) = 1 - \Phi(1.414) \approx 0.079\]

There is about an 8% chance the sample mean exceeds 2.4 hours. The individual lifetimes are exponential and highly right-skewed — but the average over 50 of them behaves almost exactly like a normal random variable.

57.7 Where this goes

This chapter provides the foundation for Chapter 2: Mathematical statistics (this volume). Everything in that chapter — maximum likelihood estimation, confidence intervals, hypothesis tests — takes the distributions developed here and uses them to make inferences about unknown parameters from data. The CLT is the engine: it justifies using normal-distribution machinery on sample means regardless of the underlying population distribution, which is why \(z\)-tests and \(t\)-tests work in practice.

The connections extend further into other sections of Vol 7. The Poisson distribution arises in queuing theory (a special case of stochastic processes). The normal distribution underlies error analysis in numerical methods: when you propagate measurement uncertainties through a computation, the CLT explains why the output errors are approximately normal. In engineering statistics, the same distributions appear in reliability theory, quality control (Six Sigma thresholds are expressed in \(\sigma\) units), and signal detection.

Where this shows up

  • A reliability engineer models component lifetimes as exponential and uses the memoryless property to compute replacement schedules.
  • An actuary prices insurance using the Poisson distribution to model the number of claims per period.
  • A machine learning engineer interprets model output probabilities using the Bayes framework — the model’s output is \(P(Y \mid X)\), not \(P(X \mid Y)\).
  • A signal processing engineer applies the CLT to justify that thermal noise in electronic circuits is modelled as Gaussian.
  • A quality control engineer uses the binomial distribution to decide whether a batch rejection threshold is appropriate for a given defect rate.

57.8 Exercises

These are puzzles. Each has a clean numerical answer. The interesting part is identifying which distribution applies and setting up the probability correctly.


Exercise 1. A factory has two machines. Machine A produces 60% of output and has a 4% defect rate. Machine B produces the remaining 40% and has a 1% defect rate. An inspector picks a component at random and finds it defective. What is the probability it came from Machine A?


Exercise 2. A quality control test checks batches of 8 components. Each component, independently, has a failure probability of 0.1. Find the probability that at least 3 components in a batch fail.


Exercise 3. A call centre receives calls at an average rate of 4 per minute. In a 30-second window, what is the probability of receiving 3 or more calls?


Exercise 4. The time to failure of a device follows an exponential distribution with rate \(\lambda = 0.02\) per hour. Find: (a) \(P(T > 100)\), (b) \(E[T]\), (c) \(\text{Var}(T)\).


Exercise 5. A manufacturing process produces items whose length \(X \sim N(75, 100)\) mm (mean 75 mm, variance 100 mm²). The specification requires length greater than 85 mm. Find \(P(X > 85)\).


Exercise 6. Let \(X_1, X_2, \ldots, X_{50}\) be i.i.d. \(\text{Exp}(1)\) random variables. Their mean is \(\mu = 1\) and variance is \(\sigma^2 = 1\). Use the CLT to approximate \(P(\bar{X}_{50} > 1.2)\).