58 Mathematical statistics

Inference from data

A production line fills bottles to a stated volume. A sample of 25 bottles shows a mean fill of 497 mL against a target of 500 mL. Is that a real problem, or just expected random variation? You need a principled way to decide.

A new catalyst cuts reaction time in a lab test. But the test used only eight runs. How confident are you that the effect is real and not noise?

You have data. The question is what the data can actually tell you — and, just as importantly, what it cannot.

58.1 Point estimation

You observe \(n\) independent realisations \(x_1, x_2, \ldots, x_n\) from a distribution that has an unknown parameter \(\theta\) — a proportion, a rate, a mean. A point estimate \(\hat{\theta}\) — read “theta-hat” — is a single number computed from the sample to serve as your best guess at \(\theta\).

The most systematic method for finding point estimates is maximum likelihood estimation (MLE): choose the value of \(\theta\) that makes the observed data most probable.

58.1.1 Setting up the likelihood

If the \(x_i\) are i.i.d. (independent, identically distributed) from a distribution with density (or mass function) \(f(x; \theta)\), the likelihood is the joint probability of the observed data, viewed as a function of \(\theta\):

\[L(\theta) = \prod_{i=1}^{n} f(x_i; \theta)\]

Read this as: the product, over all \(n\) observations, of the density evaluated at each \(x_i\). The MLE is the \(\theta\) that maximises this.

Products are hard to differentiate. Take the natural log — it’s monotone, so the maximum is preserved. The log-likelihood is:

\[\ell(\theta) = \log L(\theta) = \sum_{i=1}^{n} \log f(x_i;\theta)\]

Read “\(\ell(\theta)\)” as “ell of theta”. Now differentiate and set equal to zero.

58.1.2 MLE for the Bernoulli distribution

Let \(X_i \sim \text{Bernoulli}(p)\): each observation is 1 (success) or 0 (failure), with unknown success probability \(p\). Then \(f(x; p) = p^x (1-p)^{1-x}\), so:

\[\ell(p) = \sum_{i=1}^n \left[ x_i \log p + (1-x_i)\log(1-p) \right] = \left(\sum x_i\right)\log p + \left(n - \sum x_i\right)\log(1-p)\]

Let \(S = \sum x_i\) (the count of successes). Differentiate with respect to \(p\) and set to zero:

\[\frac{d\ell}{dp} = \frac{S}{p} - \frac{n - S}{1-p} = 0\]

Multiply through by \(p(1-p)\):

\[S(1-p) - (n-S)p = 0 \implies S - Sp - np + Sp = 0 \implies S = np\]

\[\boxed{\hat{p} = \frac{S}{n} = \bar{X}}\]

The MLE is the sample proportion. This confirms the intuition.

58.1.3 MLE for the Exponential distribution

Let \(X_i \sim \text{Exp}(\lambda)\): density \(f(x;\lambda) = \lambda e^{-\lambda x}\) for \(x \geq 0\). The log-likelihood is:

\[\ell(\lambda) = \sum_{i=1}^n \left[\log\lambda - \lambda x_i\right] = n\log\lambda - \lambda \sum_{i=1}^n x_i\]

Differentiate and set to zero:

\[\frac{d\ell}{d\lambda} = \frac{n}{\lambda} - \sum x_i = 0 \implies \lambda = \frac{n}{\sum x_i}\]

\[\boxed{\hat{\lambda} = \frac{1}{\bar{X}}}\]

The MLE of the rate is the reciprocal of the sample mean. If the mean lifetime in the sample is 5 hours, you estimate the failure rate as 0.2 failures per hour.

58.1.4 Unbiasedness and consistency

A point estimate is unbiased if \(E[\hat{\theta}] = \theta\) — the estimate is correct on average. The sample mean \(\bar{X}\) is unbiased for \(\mu\); the sample variance \(s^2 = \frac{1}{n-1}\sum(X_i - \bar{X})^2\) is unbiased for \(\sigma^2\) (the \(n-1\) denominator matters here — the \(n\) version is biased). The reason is that \(\bar{X}\) is estimated from the same data: once the mean is fixed, only \(n-1\) of the deviations \((X_i - \bar{X})\) are free to vary, so the sum of squared deviations has \(n-1\) degrees of freedom, not \(n\).

An estimate is consistent if \(\hat{\theta} \to \theta\) in probability as \(n \to \infty\) — more data drives the estimate to the truth. The MLE is consistent under mild regularity conditions.

Why MLE works

The likelihood is the probability of your data given \(\theta\). Maximising it over \(\theta\) answers: “which parameter value would make this data least surprising?” That question has a principled answer even when no other method is obvious — just write down the distribution, form the product, and differentiate.

58.2 Confidence intervals

A point estimate is a single number. It conceals the uncertainty. A confidence interval (CI) pairs the estimate with a margin that quantifies how much the estimate might be off.

58.2.1 The problem with small samples

If \(\sigma\) (the population standard deviation) were known, the sampling distribution of \(\bar{X}\) is exactly \(N(\mu, \sigma^2/n)\), and you could build an interval using the standard normal \(z\)-distribution. But \(\sigma\) is almost never known — you estimate it from the sample as \(s\). When \(n\) is small, this extra source of uncertainty matters.

Student’s \(t\)-distribution handles this. The statistic

\[T = \frac{\bar{X} - \mu}{s/\sqrt{n}}\]

follows a \(t\)-distribution with \(\nu = n-1\) degrees of freedom.

Read “\(s/\sqrt{n}\)” as the standard error of the mean — the typical spread of sample means around the true mean. The \(t\)-distribution looks like the standard normal but with heavier tails; as \(\nu \to \infty\) it converges to the normal.

58.2.2 Constructing a 95% CI for \(\mu\)

Let \(t_{\alpha/2,\,n-1}\) be the upper \(\alpha/2\) critical value of the \(t\)-distribution with \(n-1\) degrees of freedom — the value such that \(P(T > t_{\alpha/2,\,n-1}) = \alpha/2\). For a 95% CI, \(\alpha = 0.05\).

Inverting \(|T| \leq t_{\alpha/2,\,n-1}\) gives the interval. Substituting \(T = (\bar{X} - \mu)/(s/\sqrt{n})\) and expanding the absolute value:

\[-t_{\alpha/2,\,n-1} \leq \frac{\bar{X} - \mu}{s/\sqrt{n}} \leq t_{\alpha/2,\,n-1}\]

Multiply through by \(s/\sqrt{n}\) and rearrange to isolate \(\mu\):

\[\bar{X} - t_{\alpha/2,\,n-1}\cdot\frac{s}{\sqrt{n}} \;\leq\; \mu \;\leq\; \bar{X} + t_{\alpha/2,\,n-1}\cdot\frac{s}{\sqrt{n}}\]

Written compactly:

\[\bar{X} \pm t_{\alpha/2,\,n-1} \cdot \frac{s}{\sqrt{n}}\]

This is the 95% confidence interval for \(\mu\).

Worked example. A materials lab measures Young’s modulus for 10 specimens of a polymer. The sample mean is \(\bar{x} = 3.42\) GPa and the sample standard deviation is \(s = 0.18\) GPa. The \(t\)-table gives \(t_{0.025,\,9} = 2.262\).

\[\text{CI} = 3.42 \pm 2.262 \times \frac{0.18}{\sqrt{10}} = 3.42 \pm 2.262 \times 0.0569 = 3.42 \pm 0.129\]

The 95% CI is \((3.29,\; 3.55)\) GPa.

58.2.3 What a confidence interval means — and what it does not

The 95% CI does not mean “there is a 95% probability that \(\mu\) lies in this interval.” After the data are observed, \(\mu\) is either in the interval or it isn’t — there is no probability involved.

The correct interpretation: if you repeated this experiment many times and built a 95% CI each time, about 95% of those intervals would contain the true \(\mu\). The procedure has 95% coverage. Any particular interval is just one instance of that procedure.

58.3 Hypothesis testing

Point estimates and CIs describe what the data show. Hypothesis testing asks a sharper question: is there enough evidence to reject a specific claim?

58.3.1 The framework

The null hypothesis \(H_0\) is the claim under scrutiny — the default, conservative position. The alternative hypothesis \(H_1\) is what you’d conclude if \(H_0\) is implausible given the data.

A test statistic is a function of the data whose distribution under \(H_0\) is known. You compute the observed value, then ask: if \(H_0\) were true, how often would you see a value this extreme or more extreme? That probability is the p-value.

If the p-value is below the significance level \(\alpha\) (typically 0.05), you reject \(H_0\).

Two error types:

	\(H_0\) true	\(H_0\) false
Reject \(H_0\)	Type I error (false positive), probability \(\alpha\)	Correct
Fail to reject \(H_0\)	Correct	Type II error (false negative), probability \(\beta\)

Choosing \(\alpha = 0.05\) means you’re willing to falsely reject a true \(H_0\) in 5% of experiments. Reducing \(\alpha\) reduces Type I errors but increases Type II errors. The power of a test is \(1 - \beta\) — the probability of correctly detecting a real effect.

58.3.2 One-sample \(t\)-test

Hypotheses. You want to test whether a population mean \(\mu\) equals a specified value \(\mu_0\): \[H_0: \mu = \mu_0 \qquad H_1: \mu \neq \mu_0 \quad \text{(two-sided)}\]

Test statistic.

\[t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}\]

Under \(H_0\) this follows \(t_{n-1}\).

Decision rule. Reject \(H_0\) if \(|t| > t_{\alpha/2,\,n-1}\).

Worked example. A valve is designed to open at 200 kPa. Quality control samples \(n = 20\) valves. The sample mean opening pressure is \(\bar{x} = 203.4\) kPa with \(s = 5.2\) kPa. Test \(H_0: \mu = 200\) at \(\alpha = 0.05\).

\[t = \frac{203.4 - 200}{5.2/\sqrt{20}} = \frac{3.4}{1.163} = 2.923\]

The critical value is \(t_{0.025,\,19} = 2.093\). Since \(2.923 > 2.093\), reject \(H_0\). There is statistically significant evidence that the mean opening pressure differs from 200 kPa.

58.3.3 Two-sample \(t\)-test

To compare means \(\mu_1\) and \(\mu_2\) from two independent populations (assuming equal variances), form the pooled variance:

\[s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}\]

The test statistic is:

\[t = \frac{\bar{x}_1 - \bar{x}_2}{s_p\sqrt{1/n_1 + 1/n_2}}\]

Under \(H_0: \mu_1 = \mu_2\), this follows \(t_{n_1+n_2-2}\). Reject when \(|t| > t_{\alpha/2,\,n_1+n_2-2}\).

Welch’s \(t\)-test removes the equal-variance assumption by adjusting the degrees of freedom; it’s the default in most software.

58.3.4 \(\chi^2\) goodness-of-fit

When the data are counts in categories, the \(\chi^2\) statistic tests whether the observed frequencies are consistent with a specified distribution. For \(k\) categories with observed counts \(O_i\) and expected counts \(E_i = n p_i\):

\[\chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i}\]

Under \(H_0\), this follows approximately \(\chi^2_{k-1}\). Reject if \(\chi^2 > \chi^2_{\alpha,\,k-1}\). The rule of thumb \(E_i \geq 5\) in every cell ensures the approximation is reliable.

58.4 Linear regression

Data rarely have a single variable of interest in isolation. Regression asks how one variable changes with another — and how reliably.

58.4.1 The model

The simple linear regression model is:

\[Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \quad i = 1, \ldots, n\]

\(Y_i\) is the response (outcome), \(x_i\) is the predictor (input), \(\beta_0\) is the intercept (read “beta-zero” — the expected response when \(x = 0\)), \(\beta_1\) is the slope (read “beta-one” — the change in expected response per unit increase in \(x\)), and \(\varepsilon_i\) is the error term, assumed i.i.d. \(N(0, \sigma^2)\).

58.4.2 OLS estimators

Ordinary least squares minimises the sum of squared residuals:

\[\text{SS}_{res} = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2\]

Read “\(\text{SS}_{res}\)” as “residual sum of squares” — the total vertical distance squared between the data points and the fitted line.

Differentiate \(\text{SS}_{res}\) with respect to \(\hat{\beta}_0\) and \(\hat{\beta}_1\) and set both partial derivatives to zero. This gives the normal equations:

\[\frac{\partial\,\text{SS}_{res}}{\partial\hat{\beta}_0} = -2\sum(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0\]

\[\frac{\partial\,\text{SS}_{res}}{\partial\hat{\beta}_1} = -2\sum x_i(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0\]

From the first equation: distribute the sum and use \(\sum_{i=1}^n 1 = n\):

\[\sum(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0 \implies n\bar{y} - n\hat{\beta}_0 - \hat{\beta}_1 \cdot n\bar{x} = 0\]

Divide by \(n\): \(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\).

Substituting into the second and solving:

\[\boxed{\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{S_{xy}}{S_{xx}}}\]

\[\boxed{\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}}\]

where \(S_{xy} = \sum(x_i - \bar{x})(y_i - \bar{y})\) and \(S_{xx} = \sum(x_i - \bar{x})^2\).

58.4.3 \(R^2\) — coefficient of determination

Define the total sum of squares \(\text{SS}_{tot} = \sum(y_i - \bar{y})^2\) and the residual sum of squares \(\text{SS}_{res} = \sum(y_i - \hat{y}_i)^2\). The coefficient of determination is:

\[R^2 = 1 - \frac{\text{SS}_{res}}{\text{SS}_{tot}}\]

\(R^2 = 1\) means the line passes through every data point. \(R^2 = 0\) means the line explains none of the variation — it is no better than simply predicting \(\bar{y}\) for every observation. In simple linear regression, \(R^2 = r^2\) where \(r\) is the Pearson correlation coefficient.

58.4.4 \(t\)-test for \(\beta_1\)

Under the model assumptions, the standardised slope estimate

\[t = \frac{\hat{\beta}_1 - 0}{s_e / \sqrt{S_{xx}}}\]

follows \(t_{n-2}\) under \(H_0: \beta_1 = 0\), where \(s_e^2 = \text{SS}_{res}/(n-2)\) is the estimated error variance. The denominator is \(n-2\) rather than \(n-1\) because two parameters (\(\beta_0\) and \(\beta_1\)) are estimated from the data, consuming two degrees of freedom. Reject \(H_0\) if \(|t| > t_{\alpha/2,\,n-2}\). Failing to reject means there is no statistically significant linear relationship between \(x\) and \(Y\).

58.4.5 Residual analysis

The regression assumptions — linearity, constant variance, normality of errors, independence — are checked through the residuals \(\hat{\varepsilon}_i = y_i - \hat{y}_i\).

Plot residuals against fitted values \(\hat{y}_i\):

Random scatter around zero — assumptions satisfied.
Curved pattern — linearity is violated; consider a transformation or a polynomial term.
Fan shape (variance increasing with \(\hat{y}_i\)) — heteroscedasticity; consider a log or square-root transformation of \(Y\).
Outliers — single points far from zero; investigate individually.

A normal Q-Q plot of residuals checks the normality assumption. Points falling close to the diagonal line support the assumption.

58.5 One-way ANOVA

When you want to compare means across three or more groups, running multiple \(t\)-tests inflates the Type I error rate. One-way ANOVA (analysis of variance) tests all groups simultaneously.

58.5.1 The F-test

Suppose you have \(k\) groups, with \(n_j\) observations in group \(j\). Let \(\bar{y}_{j\cdot}\) denote the mean of group \(j\) and \(\bar{y}\) the overall mean. Define:

\[\text{SS}_{between} = \sum_{j=1}^k n_j (\bar{y}_{j\cdot} - \bar{y})^2 \qquad \text{(variation explained by group membership)}\]

\[\text{SS}_{within} = \sum_{j=1}^k \sum_{i=1}^{n_j} (y_{ij} - \bar{y}_{j\cdot})^2 \qquad \text{(variation within groups)}\]

The mean squares are \(\text{MS}_{between} = \text{SS}_{between}/(k-1)\) and \(\text{MS}_{within} = \text{SS}_{within}/(N-k)\), where \(N = \sum n_j\).

The F-statistic is:

\[F = \frac{\text{MS}_{between}}{\text{MS}_{within}}\]

Under \(H_0: \mu_1 = \mu_2 = \cdots = \mu_k\), \(F\) follows \(F_{k-1,\,N-k}\). Reject \(H_0\) if \(F > F_{\alpha,\,k-1,\,N-k}\). A large \(F\) means between-group variation is large relative to within-group variation — the groups genuinely differ.

Why F instead of multiple \(t\)-tests?

With \(k = 3\) groups, running all pairwise \(t\)-tests at \(\alpha = 0.05\) gives a familywise error rate of approximately \(1 - (0.95)^3 \approx 0.14\). You’d falsely reject a true null about 14% of the time rather than 5%. ANOVA controls this by testing all groups in a single F-test. If ANOVA rejects \(H_0\), post-hoc pairwise tests (Tukey HSD, Bonferroni) then identify which groups differ.

Engineering application. Three machine settings — A, B, C — produce polymer film. Quality engineers measure tensile strength (MPa) from five runs at each setting:

Setting A	Setting B	Setting C
41.2	38.5	44.1
40.8	39.2	43.8
42.1	37.9	45.0
41.5	38.8	44.3
40.4	39.6	43.4

Group means: \(\bar{y}_{A} = 41.2\), \(\bar{y}_{B} = 38.8\), \(\bar{y}_{C} = 44.12\). Overall mean \(\bar{y} = 41.37\).

\(\text{SS}_{between} = 5(41.2-41.37)^2 + 5(38.8-41.37)^2 + 5(44.12-41.37)^2\) \(= 5(0.0289) + 5(6.6049) + 5(7.5625) = 0.1445 + 33.02 + 37.81 = 70.98\)

\(\text{MS}_{between} = 70.98/2 = 35.49\)

Computing \(\text{SS}_{within}\) from deviations within each group gives approximately \(4.61\), so \(\text{MS}_{within} = 4.61/12 = 0.384\).

\[F = \frac{35.49}{0.384} = 92.4\]

The critical value \(F_{0.05,\,2,\,12} = 3.89\). Since \(92.4 \gg 3.89\), reject \(H_0\). The machine settings produce significantly different tensile strengths — the effect is not noise.

58.6 Where this goes

The methods here assume a parametric form for the distributions — Bernoulli, exponential, normal. Non-parametric tests (Wilcoxon, Mann-Whitney, Kruskal-Wallis) make weaker assumptions at the cost of some power; they become relevant when distributions are heavily skewed or sample sizes are very small. Bayesian inference reframes the problem entirely: instead of asking “what does the data tell us about \(\theta\)?” it asks “given a prior belief about \(\theta\) and the data, what should our updated belief be?” Both extensions build on exactly the likelihood functions you derived here.

Multiple linear regression extends the single-predictor model to \(Y = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p + \varepsilon\). The OLS estimator in matrix form is \(\hat{\boldsymbol{\beta}} = (X^TX)^{-1}X^T\mathbf{y}\), which requires the linear algebra from Volume 7’s linear algebra section. Generalised linear models (GLMs) then extend regression to non-normal responses — logistic regression for binary outcomes, Poisson regression for counts — connecting directly to the MLE framework you already have.

Where this shows up

Reliability engineering: MLE estimates failure rates from field data; confidence intervals bound the estimate; hypothesis tests determine whether a design change made a difference.
Clinical trials: two-sample \(t\)-tests compare treatment and control groups; sample size is set to achieve target power.
Process control: \(t\)-tests on subgroup means detect shifts in a manufacturing process mean; ANOVA compares multiple line configurations simultaneously.
Geophysics: linear regression of seismic travel time against distance estimates subsurface velocity; the \(t\)-test for \(\beta_1\) determines whether the relationship is real.
Machine learning: the OLS solution is closed-form linear regression; \(R^2\) and residual analysis evaluate model fit before any regularisation is added.

58.7 Exercises

These are puzzles. Each has a clean numerical answer. The work is in setting up the likelihood, test statistic, or normal equations correctly before computing.

Exercise 1. An engineer records the time to failure (hours) for five identical electronic components under accelerated testing:

\[x_1 = 820, \quad x_2 = 1\,040, \quad x_3 = 650, \quad x_4 = 930, \quad x_5 = 760\]

Assuming lifetimes are Exponential(\(\lambda\)), find \(\hat{\lambda}\) (the MLE) and its standard error. Use the fact that for the Exponential MLE, \(\text{SE}(\hat{\lambda}) = \hat{\lambda}/\sqrt{n}\).

Exercise 2. A tensile test on 16 specimens of an aluminium alloy yields a sample mean yield strength of \(\bar{x} = 48.2\) MPa and sample standard deviation \(s = 6.4\) MPa. Construct a 95% confidence interval for the population mean yield strength. Use \(t_{0.025,\,15} = 2.131\).

Exercise 3. A CNC milling process is specified to produce shafts with a mean diameter of 50.00 mm. A quality technician measures a random sample of 25 shafts and finds \(\bar{x} = 47.8\) mm and \(s = 5.1\) mm. Test the null hypothesis \(H_0: \mu = 50\) against \(H_1: \mu \neq 50\) at significance level \(\alpha = 0.05\). Use \(t_{0.025,\,24} = 2.064\). State your conclusion.

Exercise 4. Two aluminium alloys are compared for tensile strength. Alloy 1 (\(n_1 = 10\), \(\bar{x}_1 = 312\) MPa, \(s_1 = 14\) MPa) and Alloy 2 (\(n_2 = 12\), \(\bar{x}_2 = 298\) MPa, \(s_2 = 18\) MPa). Assuming equal population variances, test \(H_0: \mu_1 = \mu_2\) against \(H_1: \mu_1 \neq \mu_2\) at \(\alpha = 0.05\). Use \(t_{0.025,\,20} = 2.086\).

Exercise 5. A process engineer suspects that curing temperature \(x\) (°C) affects bond strength \(Y\) (N/mm²). Five runs are recorded:

\(x\)	120	130	140	150	160
\(Y\)	8.1	9.4	10.8	11.5	12.9

Compute the OLS estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\).
Compute \(R^2\).
Test \(H_0: \beta_1 = 0\) at \(\alpha = 0.05\), given \(t_{0.025,\,3} = 3.182\).

Exercise 6. Three CNC machine settings produce the following daily output yields (units per shift):

Setting 1	Setting 2	Setting 3
102	98	115
105	101	112
99	97	118
103	100	114

Perform a one-way ANOVA F-test for equality of means at \(\alpha = 0.05\). Use \(F_{0.05,\,2,\,9} = 4.26\). State whether setting affects yield.