34 Statistical Inference
From sample to conclusion
A study makes the news: a new energy drink improves reaction time by 12 milliseconds on average in a sample of 40 university students. The company behind it claims the drink genuinely works. A sceptic says the 12 ms gap could just be luck.
Both people are looking at the same number. How can they reach opposite conclusions?
The answer is that a single number from a single sample cannot settle the question on its own. What matters is what numbers like that look like when the drink does nothing — when the only thing producing variation is chance. If 12 ms is the kind of gap you’d often see from chance alone, the result is unremarkable. If 12 ms is extremely unusual under chance, something else is probably going on.
This chapter is about how to make that reasoning precise. The tools are called statistical inference — the machinery for drawing conclusions about populations from samples, while being honest about uncertainty.
34.1 Populations and samples
A population is the complete set of individuals or measurements you care about. A sample is the subset you can actually observe.
In most real situations, measuring the entire population is impossible. You cannot test every batch of tablets that will ever leave a factory. You cannot ask every voter in an election what they think today. You cannot measure the resting heart rate of every adult in the country. So you take a sample — as many observations as you can afford — and reason from there.
The fundamental challenge is that the sample will not perfectly reflect the population. Take a different sample and you’ll get a different mean. Take enough different samples and the sample means form their own distribution — one that carries useful information about the population.
A few terms used throughout:
- \(N\) (or sometimes the full description): the population size — total number of individuals
- \(n\): the sample size — how many you actually measured
- \(\mu\) (mu): the population mean — what you want to know but usually cannot compute directly
- \(\bar{x}\) (x-bar): the sample mean — what you computed from your \(n\) observations
- \(\sigma\): the population standard deviation — spread of the whole population
- \(s\): the sample standard deviation — spread estimated from your sample
The goal of inference is to say something credible about \(\mu\) using \(\bar{x}\), knowing that \(\bar{x}\) is a good guess but not an exact answer.
34.1.1 What makes a sample useful
Not every sample tells you much. A sample of convenience — asking only your friends, measuring only the easiest-to-reach parts of a forest, testing only patients who volunteer — may be biased in ways that invalidate any conclusion about the broader population.
A representative sample is one where each member of the population has a fair chance of being included. The cleanest version is a simple random sample: every member of the population is equally likely to be chosen, and choices are independent. In practice, this is an ideal rather than a guarantee, but it is the assumption that makes everything in this chapter valid. Whenever you see a claim built on inference, ask: was the sample actually representative?
34.2 The sampling distribution
Suppose heights in a population are normally distributed with mean \(\mu = 170\) cm and standard deviation \(\sigma = 8\) cm. You take a random sample of \(n = 25\) people and compute their mean height \(\bar{x}\).
Now imagine doing this many times: take 25 people, compute \(\bar{x}\), record it. Take another 25, compute \(\bar{x}\), record it. Repeat this thousands of times.
What would you see if you plotted all those sample means?
Three things, and all three are predictable:
The sample means cluster around \(\mu = 170\). No surprise — if your sampling is unbiased, the average of the averages should track the true population mean.
The sample means are less spread out than individual heights. Averaging smooths out extremes. A single person might be 155 cm or 185 cm — that’s plausible. But a mean of 25 people being 155 cm would require nearly everyone in the sample to be very short, which is far less likely.
The distribution of sample means is approximately normal — even if the population itself is not perfectly normal.
This distribution — the distribution of \(\bar{x}\) across all possible samples of size \(n\) — is called the sampling distribution of the mean.
34.3 The Central Limit Theorem
The three observations above are not separate facts. They are consequences of a single result called the Central Limit Theorem (CLT).
The Central Limit Theorem
If you take random samples of size \(n\) from a population with mean \(\mu\) and standard deviation \(\sigma\), then for sufficiently large \(n\), the sampling distribution of \(\bar{x}\) is approximately normal:
\[\bar{x} \sim N\!\left(\mu,\, \frac{\sigma^2}{n}\right)\]
The mean of the sampling distribution equals \(\mu\). The standard deviation of the sampling distribution is \(\dfrac{\sigma}{\sqrt{n}}\).
The standard deviation of the sampling distribution has its own name: the standard error (SE).
\[SE = \frac{\sigma}{\sqrt{n}}\]
The standard error is not the spread of individual measurements — it is the spread of sample means. It tells you how much \(\bar{x}\) typically varies from one sample to the next.
34.3.1 Why this is surprising
The CLT makes no assumption about the shape of the population distribution. Whether the individual measurements are skewed, uniform, bimodal, or anything else, the sample means pile up into a bell curve once \(n\) is large enough.
This is the reason the normal distribution appears so often in statistical work. It is not because every quantity in nature is normally distributed. It is because sample means tend to be normally distributed, and most of the quantities we measure and compare are averages of some kind.
As a rough guide, \(n \geq 30\) is often enough for the CLT to give a good approximation. For symmetric populations, even smaller samples work well. For heavily skewed populations, larger \(n\) may be needed.
34.3.2 Working with the standard error — a numerical example
Return to the height example: \(\mu = 170\) cm, \(\sigma = 8\) cm.
For samples of size \(n = 25\):
\[SE = \frac{8}{\sqrt{25}} = \frac{8}{5} = 1.6 \text{ cm}\]
The sampling distribution is \(\bar{x} \sim N(170, 1.6^2)\). A sample mean of 173 cm is:
\[z = \frac{173 - 170}{1.6} = \frac{3}{1.6} = 1.875 \text{ standard errors above the mean}\]
For samples of size \(n = 100\):
\[SE = \frac{8}{\sqrt{100}} = \frac{8}{10} = 0.8 \text{ cm}\]
The same sample mean of 173 cm is now:
\[z = \frac{173 - 170}{0.8} = \frac{3}{0.8} = 3.75 \text{ standard errors above the mean}\]
With \(n = 25\), a sample mean of 173 cm is not very surprising — it is less than 2 standard errors out. With \(n = 100\), the same 173 cm is very surprising — nearly 4 standard errors out. Larger samples give you more information, so unusual results become more informative.
This is the key insight: more data makes your estimate more precise, and the improvement scales as \(1/\sqrt{n}\), not \(1/n\). To halve the standard error, you need four times as much data.
34.4 Confidence intervals
You take a sample, compute \(\bar{x}\), and want to say something about \(\mu\). A single number — “\(\mu\) is approximately 170 cm” — is called a point estimate. It is useful but incomplete, because it gives no indication of how uncertain you are.
An interval estimate gives a range of plausible values for \(\mu\). A confidence interval attaches a stated level of confidence to that range — typically 90%, 95%, or 99%.
34.4.1 Building a 95% confidence interval
From Chapter 2, you know that for a normal distribution, 95% of values fall within approximately 1.96 standard deviations of the mean. Precisely:
\[P(-1.96 \leq Z \leq 1.96) = 0.95\]
where \(Z \sim N(0,1)\).
Since \(\bar{x} \sim N(\mu, \sigma^2/n)\), standardising gives:
\[P\!\left(-1.96 \leq \frac{\bar{x} - \mu}{SE} \leq 1.96\right) = 0.95\]
Rearranging the inequality to isolate \(\mu\): multiply through by SE to get \(\bar{x} - 1.96 \cdot SE \leq \mu \leq \bar{x} + 1.96 \cdot SE\). This is the same algebra as rearranging \(-1.96 \leq (x-5)/2 \leq 1.96\) for \(x\) — multiply through by 2, then add 5 throughout.
\[P\!\left(\bar{x} - 1.96 \cdot SE \leq \mu \leq \bar{x} + 1.96 \cdot SE\right) = 0.95\]
This is the 95% confidence interval for \(\mu\) (when \(\sigma\) is known):
\[\boxed{\bar{x} \pm 1.96 \cdot \frac{\sigma}{\sqrt{n}}}\]
The value 1.96 is called the critical value for 95% confidence. For other confidence levels, the critical value changes:
| Confidence level | Critical value \(z^*\) |
|---|---|
| 90% | 1.645 |
| 95% | 1.960 |
| 99% | 2.576 |
34.4.2 Worked example — confidence interval
A nutritionist measures the daily sodium intake (mg) of a random sample of \(n = 36\) adults from a region. The sample mean is \(\bar{x} = 2{,}450\) mg. Population studies suggest \(\sigma = 360\) mg for daily sodium intake.
Construct a 95% confidence interval for the true mean daily sodium intake \(\mu\) in this region.
Step 1: Calculate the standard error.
\[SE = \frac{\sigma}{\sqrt{n}} = \frac{360}{\sqrt{36}} = \frac{360}{6} = 60 \text{ mg}\]
Step 2: Find the margin of error.
\[\text{Margin of error} = 1.96 \times SE = 1.96 \times 60 = 117.6 \text{ mg}\]
Step 3: Construct the interval.
\[\bar{x} \pm \text{margin of error} = 2{,}450 \pm 117.6\]
\[\text{CI: } (2{,}332.4 \text{ mg},\ 2{,}567.6 \text{ mg})\]
Step 4: Interpret.
We are 95% confident that the true mean daily sodium intake in this region is between 2,332 mg and 2,568 mg.
34.4.3 What “95% confident” actually means
This is one of the most commonly misunderstood ideas in statistics.
The incorrect interpretation: “There is a 95% probability that \(\mu\) lies in this interval.”
This sounds reasonable but is wrong. The population mean \(\mu\) is a fixed (if unknown) number — it does not have a probability distribution. It either is or is not in the interval. The interval itself is random, because \(\bar{x}\) varies from sample to sample.
The correct interpretation: If you were to repeat this sampling procedure many times — each time collecting 36 adults, computing \(\bar{x}\), and constructing the interval — then 95% of the resulting intervals would contain the true \(\mu\).
In other words, confidence describes the procedure, not any single interval. The 95% refers to the long-run success rate of the method.
In practice, you have one interval from one sample, and you cannot know whether yours is one of the 95% that work or one of the 5% that don’t. The confidence level tells you the method is reliable, not that any particular result is guaranteed.
A useful sentence template for interpreting confidence intervals
“We are [level]% confident that the true population mean [quantity] is between [lower] and [upper] [units].”
Avoid: “There is a 95% chance the mean is in this interval.” Use: “We are 95% confident the mean is in this interval.”
34.5 Hypothesis testing
A confidence interval tells you what values of \(\mu\) are plausible given your data. A hypothesis test answers a more direct question: is the evidence strong enough to reject a specific claim about \(\mu\)?
34.5.1 The logic of the test
The framework borrows from legal reasoning. A court presumes innocence until the evidence makes guilt sufficiently clear. In statistics:
- You start by assuming a specific null state — the null hypothesis \(H_0\).
- You ask: if the null hypothesis were true, how likely is it that you’d see data at least as extreme as what you observed?
- If the answer is “very unlikely,” the data is evidence against \(H_0\).
What counts as “very unlikely” is a threshold you set in advance, called the significance level \(\alpha\). The most common choice is \(\alpha = 0.05\).
34.5.2 Setting up the hypotheses
Every test involves two competing hypotheses:
The null hypothesis \(H_0\) is the claim being tested. It usually represents “no effect,” “no difference,” or a specific benchmark value. You write it as an equality: \(H_0: \mu = \mu_0\), where \(\mu_0\) is the claimed value.
The alternative hypothesis \(H_1\) (sometimes written \(H_a\)) is what you believe might be true instead. It can be:
- Two-tailed: \(H_1: \mu \neq \mu_0\) — the mean could be either higher or lower than claimed
- One-tailed (upper): \(H_1: \mu > \mu_0\) — you suspect the mean is higher
- One-tailed (lower): \(H_1: \mu < \mu_0\) — you suspect the mean is lower
The choice of one- or two-tailed test should reflect your research question before you look at the data. Choosing the direction after seeing the data inflates the apparent significance and undermines the logic of the test.
34.5.3 The test statistic
If \(H_0: \mu = \mu_0\) is true, then the sampling distribution of \(\bar{x}\) is:
\[\bar{x} \sim N\!\left(\mu_0,\, \frac{\sigma^2}{n}\right)\]
Standardising gives the z-test statistic:
\[z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}\]
This \(z\) value measures how many standard errors your observed sample mean is from the null hypothesis value. If \(|z|\) is small, your data is consistent with \(H_0\). If \(|z|\) is large, your data is unusual under \(H_0\).
34.5.4 The p-value
The p-value is the probability of observing a test statistic at least as extreme as the one you computed, if \(H_0\) is true.
- For a two-tailed test (\(H_1: \mu \neq \mu_0\)): \(p\text{-value} = 2 \times P(Z \geq |z|)\)
- For an upper-tailed test (\(H_1: \mu > \mu_0\)): \(p\text{-value} = P(Z \geq z)\)
- For a lower-tailed test (\(H_1: \mu < \mu_0\)): \(p\text{-value} = P(Z \leq z)\)
A small p-value means your observed data would be rare under \(H_0\) — the data is hard to explain by chance alone.
34.5.5 The decision rule
Compare the p-value to the significance level \(\alpha\):
- If \(p\text{-value} \leq \alpha\): reject \(H_0\). The data is sufficiently unusual under \(H_0\) to conclude there is evidence for \(H_1\).
- If \(p\text{-value} > \alpha\): fail to reject \(H_0\). The data is consistent with \(H_0\) — not proof that \(H_0\) is true, only that there is insufficient evidence to reject it.
Note the careful language: you fail to reject, not accept \(H_0\). Absence of evidence is not evidence of absence.
34.6 Worked example — full hypothesis test
The school cafeteria claims that a standard portion of their pasta contains 650 mg of sodium. A nutritionist suspects the actual amount is higher. She measures 40 randomly chosen portions and finds a sample mean of \(\bar{x} = 672\) mg. From manufacturer data, \(\sigma = 80\) mg.
Test the nutritionist’s claim at the 5% significance level.
Step 1: State the hypotheses.
The cafeteria claims \(\mu = 650\) mg. The nutritionist suspects it is higher, so this is a one-tailed (upper) test.
\[H_0: \mu = 650 \quad H_1: \mu > 650\]
Step 2: Choose the significance level.
\[\alpha = 0.05\]
Step 3: Compute the test statistic.
\[z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}} = \frac{672 - 650}{80 / \sqrt{40}} = \frac{22}{80/6.325} = \frac{22}{12.65} \approx 1.74\]
Step 4: Find the p-value.
For an upper-tailed test:
\[p\text{-value} = P(Z \geq 1.74)\]
From a standard normal table: \(P(Z \leq 1.74) \approx 0.9591\).
\[p\text{-value} = 1 - 0.9591 = 0.0409\]
Step 5: Make a decision.
\(p\text{-value} = 0.041 < \alpha = 0.05\), so we reject \(H_0\).
Step 6: State the conclusion in plain language.
At the 5% significance level, there is sufficient evidence to conclude that the mean sodium content of cafeteria pasta portions exceeds the claimed 650 mg.
Notice what this conclusion does not say. It does not say the true mean is 672 mg. It does not say the cafeteria is lying. It says the evidence is strong enough to act on the suspicion that something is off. The strength of that conclusion depends on the assumptions: random sampling, known \(\sigma\), and that the significance level was chosen before looking at the data.
34.7 Type I and Type II errors
Every hypothesis test makes a binary decision under uncertainty. Two kinds of mistakes are possible:
| \(H_0\) is actually true | \(H_0\) is actually false | |
|---|---|---|
| Reject \(H_0\) | Type I error (false positive) | Correct decision |
| Fail to reject \(H_0\) | Correct decision | Type II error (false negative) |
A Type I error is rejecting \(H_0\) when it is true — concluding there is an effect when there is none. The probability of a Type I error is exactly \(\alpha\), the significance level you chose. Setting \(\alpha = 0.05\) means you accept a 5% chance of a false positive if \(H_0\) is true.
A Type II error is failing to reject \(H_0\) when it is false — missing a real effect. Its probability is denoted \(\beta\). The power of a test is \(1 - \beta\): the probability of correctly detecting a true effect.
34.7.1 Why this tradeoff matters
Reducing \(\alpha\) (demanding stronger evidence before rejecting \(H_0\)) decreases the chance of a false positive but increases the chance of a false negative — you become more conservative and more likely to miss real effects.
The stakes differ by context. A Type I error in a drug trial means approving an ineffective drug — real harm. A Type II error means failing to approve an effective one — also real harm, but different in character. Before running a study, statisticians specify both \(\alpha\) and the desired power, then calculate the sample size \(n\) required to achieve both. Larger samples reduce both types of error simultaneously, which is why clinical trials are expensive.
A concrete framing
A fire alarm that goes off every time someone burns toast (Type I error: false positive) is annoying and teaches people to ignore it. A fire alarm that never sounds during an actual fire (Type II error: false negative) is catastrophic. Good alarm design — and good statistical design — requires thinking carefully about which error is more costly.
34.8 Connecting to the bigger picture
Every time you see a published result with a p-value, you are looking at the output of this machinery. The researcher had a null hypothesis, collected data, computed a test statistic, and compared the p-value to a threshold.
When \(\sigma\) is unknown and estimated from the sample (using \(s\) instead), the test statistic no longer follows the standard normal distribution. It follows a t-distribution, which is slightly wider and depends on the sample size. For large \(n\), the difference is small; for small samples, it matters. The t-distribution is the natural next step beyond this chapter.
Two further extensions are essential for most real work:
Comparing two groups. Does treatment A work better than treatment B? Here you test whether the difference in means \(\mu_1 - \mu_2 = 0\) rather than testing a single mean. The two-sample z-test or t-test extends the framework from this chapter directly.
Regression. Once you move to modelling how one variable predicts another, you are still doing inference — testing whether the slope of a line is zero, constructing confidence intervals around predictions. The language of \(H_0\), test statistics, and p-values is identical.
The machinery in this chapter is the foundation for all of it.
34.9 Where this goes
What this chapter enables
- Computing (comp): Machine learning validation — train/test splits, significance of benchmark improvements — is applied inference. Every time a researcher reports that model A beats model B, the question is whether the difference is real or sampling noise.
- Hard sciences (sci): Every published experiment reports a p-value. Clinical trial analysis, drug approval decisions, and environmental monitoring all run on the framework you have just learned.
- Finance and business (biz): A/B testing at scale, quality control charts, and actuarial inference are direct applications. The decision to change a product is a hypothesis test with real stakes.
- Geography and environment (geo): Climate trend detection — distinguishing a genuine warming signal from year-to-year variation — requires formal inference. So does environmental monitoring for pollution events and species population changes.
34.10 Exercises
A note on these exercises
All exercises in this chapter give you \(\sigma\), the population standard deviation. In practice, \(\sigma\) is almost never known — you estimate it from the sample using \(s\). That extension (the t-distribution) is the natural next step beyond this chapter. For now, you will always be given \(\sigma\).
Exercise 1. A population of salmon in a river system has a known standard deviation of \(\sigma = 42\) mm in body length. A researcher samples \(n\) fish from this population.
- Calculate the standard error when \(n = 9\).
- Calculate the standard error when \(n = 36\).
- Calculate the standard error when \(n = 144\).
- By what factor does the standard error change each time \(n\) increases by a factor of 4? Explain why this makes sense using the formula.
Exercise 2. A climatologist records the daily high temperature at a monitoring station over 49 randomly selected days. The sample mean is \(\bar{x} = 18.4°\)C. From long-term records, the population standard deviation is \(\sigma = 5.6°\)C.
- Calculate the standard error for this sample.
- Construct a 95% confidence interval for the true mean daily high temperature at this location.
- Construct a 99% confidence interval. (Use \(z^* = 2.576\).)
- Write one sentence interpreting your 95% CI correctly.
Exercise 3. For each of the following scenarios, state the null and alternative hypotheses. Identify whether the test is one-tailed or two-tailed. If one-tailed, state the direction.
A consumer watchdog tests whether a cereal box labelled “500 g net weight” actually contains 500 g on average. The watchdog is concerned the boxes may contain less than advertised.
A researcher believes a new fertiliser changes mean crop yield. She does not know whether the effect is positive or negative.
A traffic engineer claims that average vehicle speed on a stretch of highway is 95 km/h. A safety audit suspects drivers are exceeding this average.
Exercise 4. An athletics coach claims that the training programme he coaches produces marathon runners with a mean finish time of 210 minutes. A sports scientist believes the true mean is different (in either direction). She times 36 runners through the programme. Their sample mean finish time is \(\bar{x} = 205.2\) minutes. The population standard deviation is known to be \(\sigma = 18\) minutes.
- State \(H_0\) and \(H_1\).
- Calculate the test statistic \(z\).
- Find the p-value. (This is a two-tailed test; use \(P(Z \leq -1.60) \approx 0.055\) as a reference value.)
- At \(\alpha = 0.05\), state your decision and conclusion in plain language.
Exercise 5. A pharmaceutical company is testing a new painkiller. The clinical team must choose a significance level before running the trial.
- Explain what a Type I error would mean in this context.
- Explain what a Type II error would mean in this context.
- The team is debating between \(\alpha = 0.05\) and \(\alpha = 0.01\). Describe the tradeoff. Which error becomes more likely if they choose the stricter threshold? Which becomes less likely?
- The drug is for a condition where the standard treatment has serious side effects. How might this context influence the choice of significance level? There is no single right answer — explain your reasoning.
Exercise 6 goes beyond what this chapter formally covers — you don’t need the two-sample formula, but you do need to reason carefully about what would be required.
Exercise 6. A data scientist is running an A/B test for an e-commerce site. Version A (the existing checkout flow) has a known mean order value of \(\mu_A = \$87\) with \(\sigma_A = \$22\), estimated from thousands of previous transactions. Version B (a redesigned flow) is tested on a new sample.
- A sample of 100 users on Version B produces \(\bar{x}_B = \$93\). If you were to test whether this is significantly different from the Version A mean of $87, what would your null and alternative hypotheses be?
- What additional information would you need to conduct a formal two-sample test comparing Version A and Version B means?
- The data scientist argues: “Version B has a higher sample mean, so we should switch.” A statistician responds: “Not so fast.” Explain the statistician’s concern in terms of sampling variability and the standard error. What would a proper inference procedure provide that the raw comparison of means cannot?