Statistics — Regression and Correlation
Master Pearson's correlation coefficient, regression lines, and the critical distinction between interpolation and extrapolation. Learn to interpret, calculate, and evaluate statistical relationships for the eAssessment.
What You'll Learn
- Calculate and interpret Pearson's correlation coefficient (r)
- Find and use the equation of the regression line (ŷ = a + bx)
- Distinguish between interpolation (reliable) and extrapolation (unreliable)
- Understand residuals and their role in assessing fit
- Evaluate statistical claims using correlation and regression concepts
- Apply the “correlation does not imply causation” principle in Criterion D contexts
eAssessment Focus
Criterion A: Calculate r, find regression equations, use them for predictions.
Criterion B: Investigate patterns in bivariate data and justify conclusions.
Criterion C: Communicate using correct statistical terminology — always state the type and strength of correlation.
Criterion D: Evaluate real-world claims involving correlation — address causation, reliability, and limitations.
Key Vocabulary
| Term | Definition |
|---|---|
| Pearson's r | A measure of the strength and direction of linear correlation: −1 ≤ r ≤ 1 |
| Regression line | The line of best fit calculated to minimise the sum of squared residuals |
| Equation of regression line | ŷ = a + bx, where b is the gradient and a is the y-intercept |
| Interpolation | Using the regression line within the range of data — generally reliable |
| Extrapolation | Using the regression line outside the range of data — less reliable |
| Residual | The vertical distance from a data point to the regression line (observed − predicted) |
| Bivariate data | Data involving two variables, typically plotted on a scatter diagram |
| Causation | A direct cause-and-effect relationship between variables (correlation alone does not prove this) |
Pearson's Correlation Coefficient (r)
Pearson's r quantifies how closely bivariate data follows a linear pattern. It ranges from −1 (perfect negative) to +1 (perfect positive).
Interpreting r Values
| Value of r | Interpretation |
|---|---|
| r = 1 | Perfect positive linear correlation |
| 0.7 ≤ r < 1 | Strong positive correlation |
| 0.4 ≤ r < 0.7 | Moderate positive correlation |
| 0 < r < 0.4 | Weak positive correlation |
| r = 0 | No linear correlation |
| −0.4 < r < 0 | Weak negative correlation |
| −0.7 < r ≤ −0.4 | Moderate negative correlation |
| −1 < r ≤ −0.7 | Strong negative correlation |
| r = −1 | Perfect negative linear correlation |
The Coefficient of Determination (r²)
Regression Lines
The regression line is the mathematically best-fitting straight line through bivariate data, found by minimising the sum of squared residuals (least squares method).
Equation of the Regression Line
where b = gradient (slope) and a = y-intercept
The gradient b tells you: for every 1-unit increase in x, y changes by b units. The y-intercept a is the predicted value of y when x = 0 (which may or may not have practical meaning).
Interpretation: For each additional hour of study, the predicted exam score increases by 3.2 marks. A student studying 0 hours would be predicted to score 12.5 (the baseline).
Finding the Regression Line on a GDC
- Enter x-values in L1 and y-values in L2
- Run linear regression (LinReg or equivalent)
- Read off values of a and b, and write the equation ŷ = a + bx
- Also note the r value to assess the strength of the model
The Mean Point
The regression line always passes through the mean point (&xbar;, &ybar;). This can be used to verify your equation: substitute x = &xbar; and check that ŷ = &ybar;.
Interpolation & Extrapolation
Understanding when predictions from a regression line are reliable is critical for Criterion D questions.
Interpolation (Reliable)
Extrapolation (Unreliable)
Summary Comparison
| Feature | Interpolation | Extrapolation |
|---|---|---|
| Range | Within observed data | Outside observed data |
| Reliability | Generally reliable | Unreliable |
| Justification | Supported by surrounding data points | Assumes trend continues — may not |
| eAssessment | Must state it is interpolation | Must state it is extrapolation + warn about unreliability |
Residuals & Limitations
Residuals measure how well individual data points fit the regression line. Understanding limitations strengthens your Criterion D responses.
What Are Residuals?
A positive residual means the actual value is above the regression line. A negative residual means it is below. For a good model, residuals should be randomly scattered with no pattern.
Limitations of Correlation and Regression
Correlation ≠ Causation
A strong r does not prove x causes y. Lurking variables may explain both.
Outliers
A single outlier can dramatically change r and the regression line. Always identify and discuss outliers.
Linear Only
Pearson's r only measures linear relationships. A strong curved relationship can have r ≈ 0.
Sample Size
Small samples give unreliable r values. A large sample increases confidence in the result.
Worked Examples
These examples demonstrate the calculation, interpretation, and evaluation expected at Grade 10 eAssessment level.
r² = 0.85² = 0.7225, meaning approximately 72.3% of the variation in exam scores is explained by the linear relationship with study hours.
Limitations: Correlation does not imply causation — other factors (prior knowledge, quality of study, sleep) may drive performance. The model only captures linear relationships.
This is interpolation (25°C is within the range 10–35°C), so the prediction is reliable.
At 50°C: ŷ = 120 + 15(50) = 120 + 750 = £870.
This is extrapolation (50°C is far outside the range 10–35°C), so the prediction is unreliable. At such extreme temperatures, people may not go outside to buy ice cream at all, and the linear relationship is unlikely to hold.
Limitations:
1. Correlation ≠ causation: We cannot conclude that screen time causes lower GPA. Perhaps students with lower GPA have less academic motivation, leading to more leisure screen time (reverse causation).
2. Confounding variables: Socioeconomic background, sleep quality, and time management could all influence both variables.
3. The model is linear only — perhaps moderate screen time has little effect but excessive screen time causes a sharp decline (non-linear relationship).
ŷ = a + bx
30 = a + 4(5)
30 = a + 20
a = 10
Equation: ŷ = 10 + 4x
Verification: When x = 5: ŷ = 10 + 4(5) = 30 = &ybar; ✓
The residual is positive, meaning the actual value is 3 units above the regression line. This data point performed better than the model predicted.
Key principle: Sample size matters. A moderate correlation from a large sample is more reliable than a strong correlation from a tiny sample.
This illustrates that the population matters. A correlation found in one group cannot be assumed to apply to another. It also shows that Pearson's r depends on the range of data — restricting the range (adults only) can reduce correlation that exists in the broader population.
Practice Q&A
Attempt each question before revealing the answer. Use correct statistical terminology throughout.
Residual = observed − predicted = 27 − 25 = +2
The observed value is 2 units above the regression line.
This means 36% of the variation in y is explained by the linear relationship with x. The remaining 64% is due to other factors or random variation.
50 = a + 3.5(12) = a + 42
a = 8
Step 2: Equation: ŷ = 8 + 3.5x
Step 3: Predict: ŷ = 8 + 3.5(15) = 8 + 52.5 = 60.5
Since r = 0.91 (strong positive), the model fits well. If x = 15 is within the data range, this is interpolation and reliable. If it is outside the range, this would be extrapolation and less reliable.
Flashcard Review
Tap each card to reveal the answer. Try to answer from memory first.