33  Probability distributions

How probability spreads across possible values

You flip a fair coin 10 times. Before you do it, you know 5 heads seems most likely — but what exactly is the probability of getting 5? What about 8? And should 8 heads make you suspicious, or does it happen often enough that you’d expect it occasionally? Without knowing the full shape of the probability — who gets what share — you can’t answer any of those questions.

Here is a different situation. You’re standing in a school gymnasium with every 17-year-old in the district. Their heights spread out around some average. A few people are very tall, a few very short, most cluster near the middle. That familiar bell shape is not a coincidence — it emerges every time many small independent factors (genetics, nutrition, sleep, dozens of other things) pile on top of each other.

One more. A website gets an average of 100 visitors per hour. One night at 2am, the server logs show 140. Is that suspicious — a bot attack, maybe — or just the kind of natural swing you’d expect? You need to know how much variation is normal before you can say whether 140 is alarming.

All three situations share the same underlying need: a complete picture of how probability is spread across every possible value. That picture is a probability distribution.

33.1 What the notation is saying

33.1.1 Random variables

A random variable is a quantity whose value is determined by a random process. We write it with a capital letter — usually \(X\).

  • \(X\) = the number of heads in 10 coin flips. \(X\) could be 0, 1, 2, …, 10.
  • \(X\) = the height of a randomly chosen 17-year-old. \(X\) could be any value in some continuous range.

The notation \(P(X = k)\) means: the probability that the random variable \(X\) takes the specific value \(k\).

For 10 coin flips, \(P(X = 5)\) is the probability of getting exactly 5 heads. You’ll calculate this shortly.

One constraint: all the probabilities must add up to 1. If you list every value \(X\) could possibly take, the probabilities must account for the whole picture.

We’ll write \(\sum\) (capital sigma, the Greek letter S) to mean “add up all the terms that follow — here, over every possible value \(k\)”:

\[\sum_{\text{all } k} P(X = k) = 1\]

This is just saying that something must happen.

33.1.2 Mean, variance, and standard deviation

Once you have a distribution, you can summarise it with two numbers.

The mean (also called the expected value) tells you the long-run average — what value you’d get if you ran the random process over and over and averaged the results. Written \(\mu\) (the Greek letter mu):

\[\mu = \sum_k k \cdot P(X = k)\]

Read this as: multiply each possible value by its probability, then add everything up. It is a probability-weighted average.

The variance \(\sigma^2\) (sigma-squared) measures how spread out the distribution is — how far values tend to stray from the mean on average. The standard deviation \(\sigma\) is the square root of the variance, putting it back in the same units as \(X\) itself.

The key intuition: \(\sigma\) measures spread, not location. Two distributions can have the same mean but look completely different.

Distribution Mean \(\mu\) Std dev \(\sigma\) What it looks like
A 50 2 Narrow spike near 50
B 50 15 Wide, flat spread

If you’re measuring exam scores and \(\sigma = 2\), almost everyone scored within 4 marks of the average. If \(\sigma = 15\), scores are all over the place. Same average, completely different distributions.

33.2 The binomial distribution

The binomial distribution answers one specific question:

If I repeat an experiment \(n\) times, and each time the probability of success is \(p\), what is the probability of getting exactly \(k\) successes?

The conditions: 1. Fixed number of trials: \(n\) 2. Each trial has the same two outcomes: success or failure 3. The probability of success is the same on every trial: \(p\) 4. The trials are independent — the result of one doesn’t affect another

If your situation fits all four, the count of successes \(X\) follows a binomial distribution, written \(X \sim B(n, p)\).

33.2.1 Deriving the formula

Suppose \(n = 3\) and you want exactly \(k = 2\) successes (call them S) and 1 failure (F). One specific sequence of outcomes that gives this is: S, S, F.

The probability of that exact sequence is \(p \cdot p \cdot (1-p) = p^2(1-p)\).

But that’s only one arrangement. You could also have: S, F, S or F, S, S. There are 3 arrangements that give exactly 2 successes in 3 trials.

So the total probability is \(3 \times p^2(1-p)\).

The number 3 came from counting arrangements. In general, the number of ways to arrange \(k\) successes in \(n\) trials is written \(\binom{n}{k}\) (read “n choose k”) and calculated as:

\[\binom{n}{k} = \frac{n!}{k!(n-k)!}\]

where \(n! = n \times (n-1) \times \cdots \times 2 \times 1\) is “n factorial” — the number of ways to order \(n\) distinct items.

Calculating \(\binom{n}{k}\)

\(\binom{5}{2} = \frac{5!}{2! \cdot 3!} = \frac{5 \times 4 \times 3 \times 2 \times 1}{(2 \times 1)(3 \times 2 \times 1)} = \frac{120}{2 \times 6} = \frac{120}{12} = 10\)

A shortcut: \(\binom{n}{k} = \frac{n \times (n-1) \times \cdots \times (n-k+1)}{k!}\) — multiply \(k\) descending numbers from \(n\), divide by \(k!\).

For \(\binom{5}{2}\): \(\frac{5 \times 4}{2 \times 1} = \frac{20}{2} = 10\). Same answer, less work.

Putting it together:

\[\boxed{P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}}\]

Three parts: - \(\binom{n}{k}\): the number of arrangements of \(k\) successes in \(n\) trials - \(p^k\): the probability that each of those \(k\) trials is a success - \((1-p)^{n-k}\): the probability that each of the remaining trials is a failure

Multiply the parts: count of arrangements × probability of each arrangement.

33.2.2 Mean and variance of the binomial

You could compute \(\mu = \sum k \cdot P(X = k)\) directly, but the algebra is tedious. The results come out cleanly:

\[\mu = np \qquad \sigma^2 = np(1-p) \qquad \sigma = \sqrt{np(1-p)}\]

These make intuitive sense. If you flip a fair coin 100 times (\(n=100\), \(p=0.5\)), you’d expect 50 heads on average: \(\mu = 100 \times 0.5 = 50\). The standard deviation is \(\sqrt{100 \times 0.5 \times 0.5} = \sqrt{25} = 5\), meaning a “typical” result is within about 5 heads of 50.

33.3 The normal distribution

The binomial counts discrete things — 0 heads, 1 head, 2 heads. But heights, weights, temperatures, and exam scores can take any value in a continuous range. For these, you need a different model.

The normal distribution \(N(\mu, \sigma^2)\) is a continuous bell curve centred at \(\mu\) and spreading out by \(\sigma\) in each direction (the second parameter is the variance \(\sigma^2\), not the standard deviation — to find the spread in the original units, take the square root: \(\sigma = \sqrt{\sigma^2}\)). You specify it by two numbers only: the mean and the variance.

33.3.1 Why the bell curve appears everywhere

This is one of the most remarkable results in all of probability: when you add up many independent random influences, the total tends toward the normal distribution — regardless of what the individual influences look like. A person’s height is the sum of contributions from hundreds of genes, years of nutrition, sleep patterns, and dozens of other factors. None of those individually looks like a bell curve. Their sum does.

This is the central limit theorem. You won’t prove it at this stage, but you’ll use it, and you’ll see it confirmed every time you look at data from a natural process.

33.3.2 The 68-95-99.7 rule

For any normal distribution \(N(\mu, \sigma^2)\):

Range Probability
\(\mu - \sigma\) to \(\mu + \sigma\) (within 1 std dev) 68%
\(\mu - 2\sigma\) to \(\mu + 2\sigma\) (within 2 std dev) 95%
\(\mu - 3\sigma\) to \(\mu + 3\sigma\) (within 3 std dev) 99.7%

This is sometimes called the empirical rule. It gives you a quick sense of what is “normal” variation and what is surprising.

If exam scores are \(N(65, 144)\) — so \(\mu = 65\) and \(\sigma = 12\) — then about 95% of students score between \(65 - 24 = 41\) and \(65 + 24 = 89\). A score of 20 would be genuinely extraordinary (more than 3 standard deviations below the mean), occurring less than 0.15% of the time.

33.3.3 Standardising: the z-score

A practical problem: there is a different normal distribution for every combination of \(\mu\) and \(\sigma\). You can’t carry a different table for each one.

The solution is to convert any normal distribution to a single standard one. The standard normal distribution is \(N(0, 1)\) — mean 0, variance 1.

To convert a value \(x\) from any normal distribution to its standard normal equivalent, compute:

\[z = \frac{x - \mu}{\sigma}\]

The value \(z\) is called a z-score. It tells you how many standard deviations above (positive) or below (negative) the mean the value \(x\) sits.

Once you have \(z\), you look up \(P(Z \leq z)\) in a standard normal table or compute it on a calculator. This single table handles all normal distributions.

Example. Scores are \(N(65, 144)\), so \(\mu = 65\), \(\sigma = 12\). For a score of \(x = 83\):

\[z = \frac{83 - 65}{12} = \frac{18}{12} = 1.5\]

A score of 83 sits 1.5 standard deviations above the mean. A table gives \(P(Z \leq 1.5) \approx 0.933\), so about 93.3% of students scored 83 or below — and about 6.7% scored above 83.

33.4 Worked examples

33.4.1 Example 1 — Binomial exact (quality control)

A manufacturing line produces components. Historical data shows that 15% of components are defective. A quality inspector takes a batch of 8 components. What is the probability that exactly 2 are defective?

Here \(X\) = number of defectives, \(n = 8\), \(p = 0.15\), and we want \(P(X = 2)\).

\[P(X = 2) = \binom{8}{2}(0.15)^2(0.85)^6\]

Calculate each part:

\[\binom{8}{2} = \frac{8 \times 7}{2 \times 1} = 28\]

\[(0.15)^2 = 0.0225\]

\[(0.85)^6 = 0.85 \times 0.85 \times 0.85 \times 0.85 \times 0.85 \times 0.85 \approx 0.3771\]

\[P(X = 2) = 28 \times 0.0225 \times 0.3771 \approx 28 \times 0.008485 \approx 0.2376\]

About 24% of batches of 8 will contain exactly 2 defectives.

Check: the mean number of defectives is \(\mu = np = 8 \times 0.15 = 1.2\), so 2 is slightly above average — a 24% probability is plausible.

33.4.2 Example 2 — Binomial complement (healthcare)

A new drug cures 70% of patients. In a clinical trial, 10 patients receive the drug. What is the probability that at least 8 are cured?

\(X \sim B(10, 0.7)\). We want \(P(X \geq 8) = P(X=8) + P(X=9) + P(X=10)\).

With only 3 terms, direct calculation is manageable:

\[P(X=8) = \binom{10}{8}(0.7)^8(0.3)^2 = 45 \times 0.05765 \times 0.09 \approx 0.2335\]

\[P(X=9) = \binom{10}{9}(0.7)^9(0.3)^1 = 10 \times 0.04035 \times 0.3 \approx 0.1211\]

\[P(X=10) = \binom{10}{10}(0.7)^{10}(0.3)^0 = 1 \times 0.02825 \times 1 \approx 0.0282\]

\[P(X \geq 8) \approx 0.2335 + 0.1211 + 0.0282 = 0.3828\]

There is about a 38% probability that at least 8 of 10 patients are cured. Note: you would use the complement approach when the “at least” threshold is low — e.g., \(P(X \geq 2)\) is much easier to compute as \(1 - P(X=0) - P(X=1)\) than to sum 9 terms directly.

33.4.3 Example 3 — Normal z-score (education)

End-of-year exam scores in a large district are approximately normally distributed with mean \(\mu = 65\) and standard deviation \(\sigma = 12\). What fraction of students scored above 80?

We want \(P(X > 80)\).

Step 1: Standardise.

\[z = \frac{80 - 65}{12} = \frac{15}{12} = 1.25\]

Step 2: Look up \(P(Z \leq 1.25)\).

From a standard normal table (or calculator): \(P(Z \leq 1.25) \approx 0.8944\).

Step 3: Convert to the required probability.

\[P(X > 80) = 1 - P(X \leq 80) = 1 - 0.8944 = 0.1056\]

About 10.6% of students scored above 80.

A quick check with the 68-95-99.7 rule: 80 is \(\frac{15}{12} = 1.25\) standard deviations above the mean, which is between 1\(\sigma\) and 2\(\sigma\). The rule says about 16% of values lie beyond 1\(\sigma\), so 10.6% beyond 1.25\(\sigma\) is consistent.

33.4.4 Example 4 — Normal reverse lookup (pass with distinction)

Using the same exam distribution (\(\mu = 65\), \(\sigma = 12\)), the school awards “distinction” to the top 10% of students. What score is the distinction threshold?

We want the value \(x\) such that \(P(X > x) = 0.10\), i.e., \(P(X \leq x) = 0.90\).

Step 1: Find the z-score for the 90th percentile.

From a standard normal table, find \(z\) such that \(P(Z \leq z) = 0.90\).

Looking up 0.90: \(z \approx 1.282\).

Step 2: Unstandardise — convert back to the original scale.

Since \(z = \frac{x - \mu}{\sigma}\), rearranging gives:

\[x = \mu + z\sigma = 65 + 1.282 \times 12 = 65 + 15.38 \approx 80.4\]

A score of about 80 is the distinction threshold.

Note the structure: for a forward problem you go \(x \to z \to\) probability. For a reverse problem you go probability \(\to z \to x\). The formula \(x = \mu + z\sigma\) is just the standardisation formula solved for \(x\).

33.5 Where this goes

Statistical inference (Chapter 3) is the direct next step. A distribution gives you the expected shape of data from a random process. Inference asks the reverse question: your actual data has a shape — is that consistent with what you’d expect from the assumed distribution, or is something else going on? Every hypothesis test you’ll ever run is answering that question.

Volume 7 probability builds the mathematical foundations underneath what you’ve been using here. Why does the normal distribution have exactly the bell-curve formula it does? What does “continuous probability” mean rigorously? The answers require calculus — integration gives you the area under the bell curve — but you can use the results now.

33.6 Applications

Where these distributions appear in practice

A/B testing. A website tests two versions of a button. 200 users see version A, 200 see version B. Is the difference in click rates real or just chance? Binomial distributions tell you what click-rate differences to expect from chance alone.

Quality control charts. Manufacturing uses Shewhart control charts: plot each batch measurement, mark lines at \(\mu \pm 2\sigma\) and \(\mu \pm 3\sigma\). A point outside the \(3\sigma\) line has less than a 0.3% chance of occurring naturally — strong evidence something has gone wrong in the process.

Weather forecasting. A “70% chance of rain” is a probability from a distribution over possible outcomes. Forecast skill — whether forecasts are actually calibrated — is assessed using statistics built on the normal distribution.

Clinical trial power. Before running a drug trial, statisticians calculate how many patients they need to have a reasonable chance of detecting a real effect if one exists. That calculation requires knowing the distribution of outcomes under both the null hypothesis and the alternative. You’re one chapter away from doing this.

33.7 Exercises

Exercise 1. A student guesses randomly on a 12-question true/false quiz. Let \(X\) be the number of correct answers.

  1. State the distribution of \(X\), giving \(n\) and \(p\).
  2. Calculate \(P(X = 6)\).
  3. Calculate the mean and standard deviation of \(X\).

Exercise 2. In a city, 20% of households own an electric vehicle. A researcher surveys 15 randomly chosen households.

  1. Let \(X\) be the number of households with an electric vehicle. State the distribution of \(X\).
  2. Find \(P(X = 3)\).
  3. Find \(P(X \leq 1)\) using the complement or direct calculation.
  4. Find the mean and variance of \(X\).

Exercise 3. A coin is known to be biased: it shows heads with probability \(p = 0.4\). The coin is flipped 20 times.

  1. Write down the mean and variance of the number of heads.
  2. Without calculating individual probabilities, explain whether you would be surprised to get 14 heads. (Use \(\sigma\) to guide your reasoning.)

Exercise 4. The resting heart rate of adults in a large study is approximately normally distributed with mean \(\mu = 72\) beats per minute and standard deviation \(\sigma = 10\) bpm.

  1. Find the probability that a randomly chosen adult has a resting heart rate above 90 bpm.
  2. Find the probability that a randomly chosen adult has a resting heart rate between 62 and 82 bpm. (You may use the 68-95-99.7 rule for this part.)
  3. What percentage of adults have resting heart rates below 55 bpm?

Exercise 5. Graduate entry to a university programme requires a score in the top 5% of a standardised test. Scores are normally distributed with \(\mu = 500\) and \(\sigma = 100\).

  1. Find the z-score corresponding to the 95th percentile. (Use \(z_{0.95} \approx 1.645\).)
  2. What is the minimum score needed for entry?
  3. A student scores 620. What percentile does this correspond to? (Find \(z\), then use a table or calculator.)

Exercise 6. A marine biologist is studying kelp-eating urchins on a reef. She records the number of urchins found in each of 50 one-metre quadrats. Historical data for this reef type suggests urchin counts per quadrat follow a distribution with \(\mu = 8\) and \(\sigma = 2.8\).

  1. A count of 15 urchins is recorded in one quadrat. How many standard deviations above the mean is this?
  2. Using the 68-95-99.7 rule, roughly what fraction of quadrats would you expect to contain more than 13.6 urchins?
  3. The biologist suspects a disturbance has shifted urchin density. She finds 12 quadrats (out of 50) with counts above 13.6. Based on your answer to (b), does this seem consistent with the historical distribution? Explain briefly.