35  Data Analysis

Patterns in numbers, conclusions about the world

Spotify has about 100 million songs. When you finish a track and it auto-plays the next one, you didn’t choose that song — an algorithm did. The algorithm looked at what you’ve played, when you skipped, what similar listeners chose, and built a model of your taste from the pattern in those numbers.

That is data analysis: you have a pile of numbers that records something about the world, and you want to find the pattern, measure it, and use it to predict something new.

The methods in this chapter are the foundation of that entire enterprise. The streaming recommendation is built on techniques that grew from the same roots you’re about to learn: summarising a distribution, measuring how two variables move together, fitting a line that lets you predict one thing from another. Every time a platform says we think you’ll like this, the chain of reasoning traces back here.

The chapter has four moves. First: organise and describe data on a single variable — centre, spread, shape. Second: describe the relationship between two variables visually, before any formula. Third: measure that relationship with a single number, the correlation coefficient \(r\). Fourth: model it with a line — regression — so you can make predictions. At each step the goal is not just to produce a number, but to know what the number is actually saying.


35.1 1 · Types of data

Before you summarise data, you need to know what kind of data you have. The choice of summary and visualisation depends on it.

Categorical data names groups: favourite genre, country of birth, blood type, whether a patient recovered. The values are labels, not amounts — you cannot meaningfully average them.

Quantitative data measures amounts: temperature, height, number of streams, exam score. These values are numbers where arithmetic makes sense.

Within quantitative data there is a further distinction.

Discrete data takes separate, countable values: number of goals scored, number of siblings, number of days absent. There is no meaningful value between 2 and 3.

Continuous data can take any value in a range: height, weight, time, temperature. Between any two values there is always another one.

Statisticians sometimes also describe four levels of measurement, which capture a slightly different property — how much structure the numbers carry.

Level Example What you can do
Nominal Blood type: A, B, AB, O Count only; order is meaningless
Ordinal Survey: agree / neutral / disagree Order is meaningful; gaps between values are not
Interval Temperature in °C Differences are meaningful; zero is arbitrary (0°C is not “no temperature”)
Ratio Distance, mass, income Differences and ratios both meaningful; zero means none

This matters in practice. You can say that 40°C is 20° hotter than 20°C (interval: differences work). You cannot say it is twice as hot, because 0°C is not “no heat” — it is just the freezing point of water. But you can say that 40km is twice as far as 20km (ratio: zero means nothing there).

For this chapter the key split is categorical vs. quantitative, because it determines every choice that follows: which summary statistics to compute, which plots to draw, and whether correlation and regression make sense at all.


35.2 2 · Describing one variable

35.2.1 Centre

Three measures compete for “the typical value” of a dataset:

Mean — add all values, divide by how many.

\[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\]

Median — the middle value when the data is sorted. For an even number of values, average the two middle ones.

Mode — the most frequently occurring value.

Each one answers a slightly different question, and each has weaknesses.

Consider seven students’ weekly hours of paid work:

\[4, \; 6, \; 6, \; 8, \; 9, \; 11, \; 42\]

The mean is \((4 + 6 + 6 + 8 + 9 + 11 + 42) / 7 = 86/7 \approx 12.3\). The median is the fourth value when sorted: \(8\). The mode is \(6\).

Which is “right”? That depends on your question. The 42 is clearly unusual — one student has a second job. The mean got pulled to 12.3 by that one outlier, past six of the seven actual values. The median (\(8\)) is not sensitive to the extreme value. The mode tells you the most common single value.

The outlier rule of thumb

When a distribution has outliers or is heavily skewed, report the median as the centre. The mean is a better summary when the data is roughly symmetric and free of extreme values. Both together tell you more than either alone.

35.2.2 Spread

A measure of centre on its own is incomplete. Knowing the average temperature in a city tells you nothing about whether it’s stable or wildly variable. You need a measure of spread.

Range = maximum − minimum. Simple, but sensitive to extremes. In the work-hours example: \(42 - 4 = 38\).

Interquartile range (IQR) = \(Q_3 - Q_1\). The middle 50% of the data, untouched by the extremes.

To find \(Q_1\) and \(Q_3\): split the sorted data at the median, then find the median of each half.

For our seven values \(4, 6, 6, 8, 9, 11, 42\):

  • Median (Q2) = 8
  • Lower half: \(4, 6, 6\)\(Q_1 = 6\)
  • Upper half: \(9, 11, 42\)\(Q_3 = 11\)
  • \(\text{IQR} = 11 - 6 = 5\)

The middle 50% of students work between 6 and 11 hours per week. The 42 did not affect this at all.

Standard deviation measures the average distance from the mean. You will see this written as \(s\) for a sample:

\[s = \sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1}}\]

The formula squares the deviations (so positive and negative distances don’t cancel), takes their average (roughly — the \(n-1\) is a technical correction for samples), then takes the square root to get back to the original units.

For the work-hours example, with mean \(\approx 12.3\):

\(x_i\) \(x_i - \bar{x}\) \((x_i - \bar{x})^2\)
4 −8.3 68.89
6 −6.3 39.69
6 −6.3 39.69
8 −4.3 18.49
9 −3.3 10.89
11 −1.3 1.69
42 +29.7 882.09

Sum of squared deviations: \(1061.43\). Divide by \(n - 1 = 6\): \(176.9\). Take the square root: \(s \approx 13.3\) hours.

That large standard deviation reflects the 42-hour outlier distorting the picture. Again: when a distribution has strong outliers, IQR is the more honest spread measure.

35.2.3 Shape: the five-number summary and box plot

The five-number summary is min, \(Q_1\), median, \(Q_3\), max. It describes the distribution’s shape without computing a mean at all:

\[4, \quad 6, \quad 8, \quad 11, \quad 42\]

A box plot draws this. The box spans from \(Q_1\) to \(Q_3\). A line inside the box marks the median. “Whiskers” extend to the minimum and maximum (or to the boundary of non-outlier data, with individual dots for outliers beyond \(1.5 \times \text{IQR}\) from the box edges).

Box plots are excellent for comparing distributions across groups — for example, work hours broken down by year of study — because they pack a lot of shape information into a small space.

Outlier detection rule

A common convention: flag a value as an outlier if it falls more than \(1.5 \times \text{IQR}\) below \(Q_1\) or above \(Q_3\).

Lower fence: \(Q_1 - 1.5 \times \text{IQR} = 6 - 7.5 = -1.5\)

Upper fence: \(Q_3 + 1.5 \times \text{IQR} = 11 + 7.5 = 18.5\)

The value 42 is above 18.5, so it is flagged as an outlier. The other values all fall within the fences.


35.3 3 · Describing relationships

When you have two quantitative variables measured on the same subjects — hours studied and exam score, temperature and ice cream sales, daily steps and resting heart rate — the question changes from “what is the distribution of one variable?” to “how do these two variables move together?”

The right starting point is always a scatter plot. Put one variable on the horizontal axis (\(x\)) and the other on the vertical axis (\(y\)). Each observation becomes a dot.

Before reaching for any formula, look at the plot and describe what you see using three properties.

Direction. As \(x\) increases, does \(y\) tend to increase (positive association) or decrease (negative association), or is there no consistent tendency?

Form. Does the pattern look like a straight line (linear), a curve, or something else entirely?

Strength. How tightly do the points cluster around the pattern? A tight cluster suggests a strong relationship. A diffuse cloud suggests a weak one.

Only if the form is linear does it make sense to ask “how strong is this linear relationship?” — which is what the correlation coefficient measures. If the form is curved, the correlation coefficient can give a misleading answer (we will see this shortly).


35.4 4 · Pearson correlation

The Pearson correlation coefficient \(r\) measures the strength and direction of a linear relationship between two quantitative variables. It is always a number between \(-1\) and \(+1\).

\[r = \frac{1}{n-1} \sum_{i=1}^{n} \left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)\]

This is the average of the products of the standardised \(x\) and \(y\) values. Standardising means subtracting the mean and dividing by the standard deviation, so each variable’s scale disappears — \(r\) is unit-free.

The intuition: when a standardised \(x\) is positive (above average) and the corresponding standardised \(y\) is also positive, their product is positive. When both are below average, both are negative, and the product is still positive. When one is above and one is below, the product is negative. If there is a consistent positive trend, positive products dominate and \(r\) is close to \(+1\). If there is a consistent negative trend, negative products dominate and \(r\) is close to \(-1\). If there is no consistent linear pattern, positive and negative products cancel and \(r\) is near \(0\).

A worked example. Five students, with hours of sleep the night before an exam (\(x\)) and exam score out of 40 (\(y\)):

Student Sleep \(x\) Score \(y\)
A 5 22
B 6 25
C 7 30
D 8 33
E 9 35

\(\bar{x} = 7\), \(\bar{y} = 29\)

\(s_x = \sqrt{\frac{(5-7)^2+(6-7)^2+(7-7)^2+(8-7)^2+(9-7)^2}{4}} = \sqrt{\frac{4+1+0+1+4}{4}} = \sqrt{2.5} \approx 1.581\)

\(s_y = \sqrt{\frac{(22-29)^2+(25-29)^2+(30-29)^2+(33-29)^2+(35-29)^2}{4}} = \sqrt{\frac{49+16+1+16+36}{4}} = \sqrt{29.5} \approx 5.431\)

Student \(\frac{x_i-\bar{x}}{s_x}\) \(\frac{y_i-\bar{y}}{s_y}\) product
A −1.265 −1.288 1.629
B −0.632 −0.737 0.466
C 0 0.184 0
D 0.632 0.737 0.466
E 1.265 1.104 1.397

Sum of products: \(3.958\). Divide by \(n - 1 = 4\): \(r \approx 0.99\).

That is a very strong positive linear relationship. More sleep, higher score — and it is nearly perfectly linear across this small sample.

Interpreting \(r\): there is no single rule, but typical conventions are:

\(|r|\) Interpretation
0.9 – 1.0 Very strong
0.7 – 0.9 Strong
0.5 – 0.7 Moderate
0.3 – 0.5 Weak
0.0 – 0.3 Very weak or none

\(r\) measures only linear association

If the relationship is curved, \(r\) can be close to zero even when there is a strong and obvious pattern. A dataset where \(y = x^2\) centred at zero will give \(r \approx 0\) even though \(y\) is completely determined by \(x\). This is why you always look at the scatter plot first.


35.5 5 · Correlation does not imply causation

This section deserves real time. It is not a disclaimer. It is one of the most practically important ideas in this entire volume.

Finding that \(r = 0.78\) between two variables does not tell you that one causes the other. It tells you they are associated — they tend to move together. Association and causation are different things, and confusing them is one of the most common errors in public reasoning about data.

There are three mechanisms that produce correlation without causation.


Mechanism 1: Common cause (confounding)

Ice cream sales and drowning deaths are positively correlated. In months when ice cream sales are high, drowning deaths are also high. Does eating ice cream cause drowning?

No. Both are caused by a third variable: hot weather. When it is hot, more people buy ice cream and more people swim. The ice cream and the drownings are both responding to temperature. Temperature is the confounding variable — a common cause that creates an apparent relationship between two things that do not directly affect each other.

This is extremely common in health and social science research. Two variables that seem related often share an underlying driver.


Mechanism 2: Reverse causation

Suppose a survey finds that people who sleep fewer hours are more likely to be depressed. The tempting interpretation: poor sleep causes depression.

But the causal arrow might point the other way. Depression is a common cause of disturbed sleep. The correlation is real; the causal direction is not established by the correlation alone.

Or both things cause each other in a loop — poor sleep worsens mood, worsened mood disrupts sleep. Correlation cannot untangle this.


Mechanism 3: Coincidence

Nicolas Cage releases more films in years when more people drown in swimming pools. The correlation over a twelve-year period was reported at \(r \approx 0.67\).

There is obviously no causal connection. With a sufficiently large collection of variables measured over time, some pairs will be correlated by chance. The internet is full of these “spurious correlations” — per capita cheese consumption and deaths by bedsheet tangling, US spending on science and suicides by hanging. These are real correlations in the data. They mean nothing.


What does establish causation?

Observational data (just measuring things in the world) can suggest causal hypotheses. Establishing causation requires:

  • Randomised controlled experiments — randomly assign subjects to conditions so confounders are distributed equally across groups.
  • Strong temporal evidence — the cause must precede the effect.
  • Plausible mechanism — a credible explanation of how one thing could cause the other.
  • Replication — the relationship holds across different populations and study designs.

When you read a headline claiming “X causes Y,” your first question should be: was this a randomised experiment, or an observational study? If the latter, the correct language is “X is associated with Y” — not “X causes Y.”

The student’s toolkit for evaluating causal claims

Ask three questions:

  1. Could a third variable (confounder) explain both X and Y?
  2. Could the direction be reversed — does Y cause X instead?
  3. Could this be coincidence in a large dataset?

If any of these is plausible, the causal claim is not established.


35.6 6 · Linear regression

Once you have established (from a scatter plot) that two variables have a linear form, you can model that relationship with a line. That line lets you do two things the correlation coefficient cannot: make predictions for new values, and quantify how much \(y\) changes per unit of \(x\).

The problem. Suppose you have a scatter of points that approximately follow a line. Which line? You could draw many lines that pass “through” the cloud. The least-squares line is the one that minimises the sum of squared residuals — where a residual \(e_i = y_i - \hat{y}_i\) is the vertical distance from each point to the line.

Why squared? Squaring makes all residuals positive (so positive and negative errors don’t cancel), and it penalises large errors more heavily than small ones. The least-squares line is the unique line that minimises the total squared error — it is “as close as possible” to all the points simultaneously.

The formulas. For the least-squares line \(\hat{y} = b_0 + b_1 x\):

\[b_1 = r \cdot \frac{s_y}{s_x}\]

\[b_0 = \bar{y} - b_1 \bar{x}\]

where \(r\) is the Pearson correlation, \(s_x\) and \(s_y\) are the standard deviations of \(x\) and \(y\), and \(\bar{x}\), \(\bar{y}\) are the means.

Interpretation.

\(b_1\) (slope): for every one-unit increase in \(x\), the predicted value of \(y\) changes by \(b_1\) units. Always interpret this in context — the units matter.

\(b_0\) (intercept): the predicted value of \(y\) when \(x = 0\). This is sometimes meaningful (it might represent a baseline), but often extrapolating to \(x = 0\) is outside the range of the data and the intercept is just an algebraic anchor with no practical interpretation.

The regression line always passes through \((\bar{x}, \bar{y})\). This is a useful sanity check.


35.7 7 · \(R^2\) and residual plots

35.7.1 The coefficient of determination

\(R^2\) (R-squared) measures how well the regression line fits the data. For simple linear regression, \(R^2 = r^2\) — the square of the Pearson correlation.

\[R^2 = r^2\]

\(R^2\) is the proportion of the variance in \(y\) that is explained by the linear relationship with \(x\). It ranges from 0 to 1.

\(R^2 = 0.72\) means: 72% of the variation in \(y\) is accounted for by the linear relationship with \(x\). The remaining 28% is due to other factors not in the model, natural variability, or measurement error.

\(R^2 = 0.25\) means the line captures only 25% of the variation. A lot is left unexplained — you might be missing important variables, or the relationship might not be well-described by a line.

\(R^2 = 0.95\) means 95% of the variation is explained — a strong fit.

\(R^2\) in plain language

Imagine the simplest possible model for \(y\): just predict \(\bar{y}\) for everyone. That model ignores \(x\) completely. The total spread around that flat prediction is the total variance in \(y\).

The regression line does better — it accounts for \(x\). \(R^2\) measures how much better: what fraction of the total variance did the line explain away?

35.7.2 Residual plots

Fitting a line and computing \(R^2\) is not the end of the analysis. You need to check whether the line is actually the right model.

A residual plot shows the residuals \(e_i = y_i - \hat{y}_i\) on the vertical axis against the fitted values \(\hat{y}_i\) (or against \(x\)) on the horizontal axis. A good model produces a scatter with no pattern — residuals roughly centred at zero, with no curve, fan shape, or clustering.

If the linear model is appropriate, the residuals should look like random scatter around zero — no pattern.

What patterns signal problems:

  • A curved pattern (residuals positive at the extremes, negative in the middle) means the true relationship is nonlinear. A line is the wrong model.
  • A funnel shape (residuals spreading out as \(x\) increases) means the variability in \(y\) changes with \(x\). The model’s uncertainty is not constant — a standard regression assumption is violated.
  • A few extreme residuals are outliers in the regression sense. They may be data errors or genuinely unusual cases.

The residual plot is the diagnostic. A clean residual plot validates your model; a patterned one tells you to reconsider.


35.8 8 · End-to-end worked example

A café tracks average daily temperature (°C) and the number of iced coffee drinks sold. Here are eight days of data:

Day Temp \(x\) Drinks \(y\)
1 18 34
2 21 40
3 24 48
4 27 55
5 29 61
6 31 65
7 33 72
8 36 78

Step 1: Compute the summary statistics.

\(n = 8\)

\(\bar{x} = (18 + 21 + 24 + 27 + 29 + 31 + 33 + 36)/8 = 219/8 = 27.375\)

\(\bar{y} = (34 + 40 + 48 + 55 + 61 + 65 + 72 + 78)/8 = 453/8 = 56.625\)

\(s_x \approx 6.117\), \(s_y \approx 15.35\)

(These are computed using the standard deviation formula from Section 2.)

Step 2: Compute the Pearson correlation.

Running through all 8 standardised products (same procedure as the 5-row table in Section 4) gives:

\[r \approx 0.999\]

A near-perfect positive linear relationship. This is not surprising — the scatter plot would show the points nearly on a straight line.

Step 3: Find the least-squares line.

\[b_1 = r \cdot \frac{s_y}{s_x} = 0.999 \times \frac{15.35}{6.117} \approx 0.999 \times 2.509 \approx 2.506\]

\[b_0 = \bar{y} - b_1 \bar{x} = 56.625 - 2.506 \times 27.375 \approx 56.625 - 68.602 \approx -11.97\]

The regression equation is:

\[\hat{y} = -11.97 + 2.506x\]

Step 4: Interpret slope and intercept.

Slope: For every 1°C increase in temperature, the model predicts about 2.5 additional iced coffees sold. This makes intuitive sense.

Intercept: The model predicts \(-11.97\) drinks at 0°C. This is nonsensical — no café sells negative drinks — but we should not expect the intercept to be meaningful here. The data ranges from 18°C to 36°C. Extrapolating to 0°C is far outside that range.

Step 5: Predict a new value.

On a 30°C day, the model predicts:

\[\hat{y} = -11.97 + 2.506 \times 30 = -11.97 + 75.18 = 63.21\]

Approximately 63 iced coffees.

Step 6: Compute \(R^2\).

\[R^2 = r^2 = (0.999)^2 \approx 0.998\]

The linear model explains about 99.8% of the variation in drinks sold. Temperature is almost the entire story here (within this range and this simple model).

Step 7: Check a residual.

For Day 5, the actual value is \(y_5 = 61\). The fitted value is:

\[\hat{y}_5 = -11.97 + 2.506 \times 29 = -11.97 + 72.674 = 60.70\]

Residual: \(e_5 = 61 - 60.70 = 0.30\)

The model is off by less than one drink on this day. The residuals for all eight days would be small and show no obvious pattern — this is exactly what a well-fitting linear model looks like.


Computing form of the regression slope

An algebraically equivalent form useful for hand calculation:

\[b_1 = \frac{\sum x_i y_i - n\bar{x}\bar{y}}{\sum x_i^2 - n\bar{x}^2}\]

This is the same formula in computing form — useful when you have the raw sums but not the deviations. You will use this form in Exercise 4.

35.9 Exercises

Exercise 1. The following dataset shows the number of hours seven people spent on social media last week:

\[3, \; 7, \; 7, \; 9, \; 12, \; 14, \; 38\]

Compute the mean, median, IQR, and identify any outliers using the \(1.5 \times \text{IQR}\) rule.


Exercise 2. A scatter plot of students’ hours of revision (\(x\)) and exam percentage (\(y\)) shows:

  • Points rise from bottom-left to top-right
  • The points cluster fairly tightly around an imaginary straight line
  • There are no pronounced curves or bends in the pattern

Describe the direction, form, and strength of the association, and estimate whether \(r\) is closer to 0.1, 0.7, or 0.95. Justify your estimate.


Exercise 3. A study reports \(r = 0.82\) between the number of books students own and their reading comprehension test scores. For each of the following three statements, say whether it is correct or incorrect and explain why.

(a) “Students who own more books tend to score higher on reading tests.”

(b) “Owning more books causes higher reading comprehension.”

(c) “82% of students who own books score above average on reading tests.”


Exercise 4. A dataset of \(n = 6\) observations has the following summary totals:

\[n = 6, \quad \sum x = 42, \quad \sum y = 78, \quad \sum x^2 = 322, \quad \sum xy = 591\]

Use the computing formulas below to find \(b_1\) and \(b_0\), then write the regression equation.

\[b_1 = \frac{\sum xy - \frac{(\sum x)(\sum y)}{n}}{\sum x^2 - \frac{(\sum x)^2}{n}} \qquad b_0 = \bar{y} - b_1 \bar{x}\]


Exercise 5. A researcher models the relationship between daily temperature (\(x\), in °C) and the number of parkrun participants (\(y\)). The regression equation is:

\[\hat{y} = 120 + 3.8x\]

  1. Interpret the slope in context.

  2. Interpret the intercept in context. Is it meaningful here?

  3. Predict the number of participants on a 15°C day.

  4. On one 15°C day, the actual count was 182. Compute the residual and interpret it.


Exercise 6 — Critical thinking. A newspaper runs the headline:

Students who eat breakfast get better grades — schools should mandate breakfast to raise academic performance.

The headline is based on a large survey showing a positive correlation between breakfast-eating habits and GPA. Using the concepts from this chapter, evaluate the causal claim. What questions would you ask before accepting the conclusion? What alternative explanations exist?


35.10 Where this chapter leaves you

Volume 6 is complete.

You can now describe a distribution in full — its centre, spread, and shape — and choose the right summary for the kind of data you have and the presence or absence of outliers. You can describe the relationship between two variables from a scatter plot before any formula touches the data. You can measure that relationship with \(r\), interpret what the number actually means, and recognise the one thing it emphatically does not mean: causation. And you can build the simplest predictive model, the least-squares line, interpret its slope and intercept in context, assess its fit with \(R^2\), and diagnose problems with a residual plot.

That is a complete analytical toolkit for real data at this level.

The road ahead is wide. Volume 7 opens with the natural extension of everything here: multiple regression, where more than one predictor variable (\(x_1, x_2, \ldots\)) is used to model \(y\) simultaneously. The least-squares criterion is the same; the algebra requires linear algebra. Multiple regression is also the foundation of the simplest supervised machine learning model — gradient descent and neural networks are, at their roots, sophisticated generalisations of fitting a line to data. Everything you have learned in this chapter is load-bearing for what follows.

Where this shows up

  • A climate scientist analysing temperature trends over decades is fitting a regression line to time-series data and checking the residuals.
  • A sports analyst building a player rating model is computing correlations between performance metrics and outcomes, then building a regression to predict results.
  • A public health researcher reporting that “smoking is associated with lung cancer” is using the careful causal language earned by decades of controlled studies — because correlation alone was not enough to establish it.
  • A data scientist building a recommendation system starts with correlation matrices. The step from there to collaborative filtering and neural networks is real, but this is where the intuition lives.

The ideas are simple. Their reach is extraordinary.