Statistics — Regression and Correlation

Master Pearson's correlation coefficient, regression lines, and the critical distinction between interpolation and extrapolation. Learn to interpret, calculate, and evaluate statistical relationships for the eAssessment.

What You'll Learn

  • Calculate and interpret Pearson's correlation coefficient (r)
  • Find and use the equation of the regression line (ŷ = a + bx)
  • Distinguish between interpolation (reliable) and extrapolation (unreliable)
  • Understand residuals and their role in assessing fit
  • Evaluate statistical claims using correlation and regression concepts
  • Apply the “correlation does not imply causation” principle in Criterion D contexts

eAssessment Focus

Criterion A: Calculate r, find regression equations, use them for predictions.

Criterion B: Investigate patterns in bivariate data and justify conclusions.

Criterion C: Communicate using correct statistical terminology — always state the type and strength of correlation.

Criterion D: Evaluate real-world claims involving correlation — address causation, reliability, and limitations.

Key Vocabulary

TermDefinition
Pearson's rA measure of the strength and direction of linear correlation: −1 ≤ r ≤ 1
Regression lineThe line of best fit calculated to minimise the sum of squared residuals
Equation of regression lineŷ = a + bx, where b is the gradient and a is the y-intercept
InterpolationUsing the regression line within the range of data — generally reliable
ExtrapolationUsing the regression line outside the range of data — less reliable
ResidualThe vertical distance from a data point to the regression line (observed − predicted)
Bivariate dataData involving two variables, typically plotted on a scatter diagram
CausationA direct cause-and-effect relationship between variables (correlation alone does not prove this)

Pearson's Correlation Coefficient (r)

Pearson's r quantifies how closely bivariate data follows a linear pattern. It ranges from −1 (perfect negative) to +1 (perfect positive).

Interpreting r Values

Value of rInterpretation
r = 1Perfect positive linear correlation
0.7 ≤ r < 1Strong positive correlation
0.4 ≤ r < 0.7Moderate positive correlation
0 < r < 0.4Weak positive correlation
r = 0No linear correlation
−0.4 < r < 0Weak negative correlation
−0.7 < r ≤ −0.4Moderate negative correlation
−1 < r ≤ −0.7Strong negative correlation
r = −1Perfect negative linear correlation
Reading r from a GDC: On the eAssessment calculator section, you will compute r using your GDC. Enter the data in two lists (L1, L2), run a linear regression, and read the r value. Always state both the strength (strong/moderate/weak) and direction (positive/negative).

The Coefficient of Determination (r²)

Key Idea
r² tells you the proportion of variation in y that is explained by the linear relationship with x.
Example: If r = 0.8, then r² = 0.64. This means 64% of the variation in y is explained by its linear relationship with x. The remaining 36% is due to other factors.
Correlation ≠ Causation: A strong correlation does not prove that one variable causes the other. There may be lurking variables, confounding factors, or coincidence. You must state this limitation in any Criterion D response involving correlation.

Regression Lines

The regression line is the mathematically best-fitting straight line through bivariate data, found by minimising the sum of squared residuals (least squares method).

x y ŷ = a + bx residual (x̄, ȳ)
Scatter plot with regression line ŷ = a + bx (amber) · The line always passes through the mean point (x̄, ȳ) · Residuals are vertical distances from points to the line

Equation of the Regression Line

Formula
ŷ = a + bx
where b = gradient (slope) and a = y-intercept

The gradient b tells you: for every 1-unit increase in x, y changes by b units. The y-intercept a is the predicted value of y when x = 0 (which may or may not have practical meaning).

Example: A regression line is ŷ = 12.5 + 3.2x, where x = hours studied and y = exam score.
Interpretation: For each additional hour of study, the predicted exam score increases by 3.2 marks. A student studying 0 hours would be predicted to score 12.5 (the baseline).

Finding the Regression Line on a GDC

The Mean Point

The regression line always passes through the mean point (&xbar;, &ybar;). This can be used to verify your equation: substitute x = &xbar; and check that ŷ = &ybar;.

Interpolation & Extrapolation

Understanding when predictions from a regression line are reliable is critical for Criterion D questions.

Interpolation (Reliable)

Definition
Using the regression line to predict values within the range of the original data.
Example: Data was collected for students studying between 2 and 10 hours. Predicting the score for a student who studied 6 hours is interpolation — reliable because the model is supported by surrounding data.

Extrapolation (Unreliable)

Definition
Using the regression line to predict values outside the range of the original data.
Example: Predicting the score for a student who studied 20 hours (when data only goes up to 10 hours) is extrapolation — unreliable because the linear relationship may not continue beyond the observed range. Perhaps returns diminish, or fatigue sets in.
eAssessment Rule: Whenever you use a regression line for prediction, you must state whether the prediction is interpolation or extrapolation, and comment on its reliability. This is a mandatory communication step for full Criterion C marks.

Summary Comparison

FeatureInterpolationExtrapolation
RangeWithin observed dataOutside observed data
ReliabilityGenerally reliableUnreliable
JustificationSupported by surrounding data pointsAssumes trend continues — may not
eAssessmentMust state it is interpolationMust state it is extrapolation + warn about unreliability

Residuals & Limitations

Residuals measure how well individual data points fit the regression line. Understanding limitations strengthens your Criterion D responses.

What Are Residuals?

Formula
Residual = Observed value − Predicted value = y − ŷ

A positive residual means the actual value is above the regression line. A negative residual means it is below. For a good model, residuals should be randomly scattered with no pattern.

Example: If a student who studied 5 hours scored 78, and the regression line predicts ŷ = 72 for x = 5, then the residual = 78 − 72 = +6. This student performed 6 marks above the predicted value.

Limitations of Correlation and Regression

Correlation ≠ Causation

A strong r does not prove x causes y. Lurking variables may explain both.

Outliers

A single outlier can dramatically change r and the regression line. Always identify and discuss outliers.

Linear Only

Pearson's r only measures linear relationships. A strong curved relationship can have r ≈ 0.

Sample Size

Small samples give unreliable r values. A large sample increases confidence in the result.

Exam Strategy: In eAssessment Criterion D questions, always discuss at least two limitations of the statistical model. Mentioning “correlation does not imply causation” is essential, but also comment on sample size, potential outliers, or non-linearity.

Worked Examples

These examples demonstrate the calculation, interpretation, and evaluation expected at Grade 10 eAssessment level.

EXAMPLE 1A study finds r = 0.85 between hours of study and exam score. Interpret this value.
+
Full Solution
r = 0.85 indicates a strong positive linear correlation between hours of study and exam score. As study hours increase, exam scores tend to increase in a fairly consistent linear pattern.

r² = 0.85² = 0.7225, meaning approximately 72.3% of the variation in exam scores is explained by the linear relationship with study hours.

Limitations: Correlation does not imply causation — other factors (prior knowledge, quality of study, sleep) may drive performance. The model only captures linear relationships.
EXAMPLE 2The regression line for data where x = temperature (°C) and y = ice cream sales (£) is ŷ = 120 + 15x. Data was collected for temperatures between 10°C and 35°C. Predict sales at 25°C and 50°C.
+
Full Solution
At 25°C: ŷ = 120 + 15(25) = 120 + 375 = £495.
This is interpolation (25°C is within the range 10–35°C), so the prediction is reliable.

At 50°C: ŷ = 120 + 15(50) = 120 + 750 = £870.
This is extrapolation (50°C is far outside the range 10–35°C), so the prediction is unreliable. At such extreme temperatures, people may not go outside to buy ice cream at all, and the linear relationship is unlikely to hold.
EXAMPLE 3A student calculates r = −0.72 between hours of screen time per day and GPA. Interpret the result and discuss limitations.
+
Full Solution
r = −0.72 indicates a strong negative linear correlation — as daily screen time increases, GPA tends to decrease.

Limitations:
1. Correlation ≠ causation: We cannot conclude that screen time causes lower GPA. Perhaps students with lower GPA have less academic motivation, leading to more leisure screen time (reverse causation).
2. Confounding variables: Socioeconomic background, sleep quality, and time management could all influence both variables.
3. The model is linear only — perhaps moderate screen time has little effect but excessive screen time causes a sharp decline (non-linear relationship).
EXAMPLE 4Given the data: &xbar; = 5, &ybar; = 30, b = 4. Find the equation of the regression line.
+
Full Solution
The regression line passes through the mean point (&xbar;, &ybar;).
ŷ = a + bx
30 = a + 4(5)
30 = a + 20
a = 10

Equation: ŷ = 10 + 4x

Verification: When x = 5: ŷ = 10 + 4(5) = 30 = &ybar; ✓
EXAMPLE 5The observed value is y = 45 and the predicted value is ŷ = 42. Calculate and interpret the residual.
+
Full Solution
Residual = observed − predicted = y − ŷ = 45 − 42 = +3.

The residual is positive, meaning the actual value is 3 units above the regression line. This data point performed better than the model predicted.
EXAMPLE 6Two studies report: Study A: r = 0.92 (n = 200). Study B: r = 0.95 (n = 8). Which provides stronger evidence of correlation?
+
Full Solution
Study A provides stronger evidence, despite having a slightly lower r value. With n = 200, the correlation is very unlikely to be due to chance. Study B's r = 0.95 with only 8 data points could easily occur by random variation in a small sample.

Key principle: Sample size matters. A moderate correlation from a large sample is more reliable than a strong correlation from a tiny sample.
EXAMPLE 7Explain why r = 0.02 between age and shoe size for adults does not mean there is no relationship between age and shoe size for children.
+
Full Solution
For adults, shoe size is essentially fixed regardless of age, so r ≈ 0 is expected. However, for children, feet grow as they age, producing a strong positive correlation.

This illustrates that the population matters. A correlation found in one group cannot be assumed to apply to another. It also shows that Pearson's r depends on the range of data — restricting the range (adults only) can reduce correlation that exists in the broader population.

Practice Q&A

Attempt each question before revealing the answer. Use correct statistical terminology throughout.

INTERPRETA researcher finds r = 0.45 between daily coffee consumption and productivity rating. Interpret this finding.
+
Model Answer
r = 0.45 indicates a moderate positive linear correlation. As daily coffee consumption increases, productivity tends to increase moderately. However, correlation does not imply causation — other factors (sleep, motivation) likely contribute. The relationship may also be non-linear (e.g., excessive coffee could reduce productivity).
CALCULATEThe regression line is ŷ = 5 + 2.5x. Find the predicted value when x = 8 and the residual if the observed value is 27.
+
Model Answer
ŷ = 5 + 2.5(8) = 5 + 20 = 25
Residual = observed − predicted = 27 − 25 = +2
The observed value is 2 units above the regression line.
EVALUATEA newspaper headline states: “Eating chocolate makes you smarter — countries with higher chocolate consumption win more Nobel Prizes.” Evaluate this claim.
+
Model Answer
This is a classic example of correlation being mistaken for causation. While there may be a positive correlation between chocolate consumption and Nobel Prizes per capita, this does not mean chocolate causes Nobel Prizes. Both are likely driven by a confounding variable: national wealth. Wealthier countries can afford more chocolate AND invest more in education and research. Additionally, the sample size (number of countries) is limited, and Nobel Prizes may not meaningfully measure a nation's intelligence.
EXPLAINWhy does the regression line always pass through (&xbar;, &ybar;)?
+
Model Answer
The regression line minimises the sum of squared residuals. Mathematically, the sum of all residuals equals zero, which is only possible if the line passes through the point (&xbar;, &ybar;). This is a property of the least-squares method and can be used as a verification tool: substitute &xbar; into your equation and check that ŷ = &ybar;.
CALCULATEIf r = −0.6, calculate r² and interpret it.
+
Model Answer
r² = (−0.6)² = 0.36
This means 36% of the variation in y is explained by the linear relationship with x. The remaining 64% is due to other factors or random variation.
JUSTIFYData was collected for ages 15–65. A regression line is used to predict the value for age 80. Is this appropriate?
+
Model Answer
No. Predicting for age 80 is extrapolation — the value is outside the range of the original data (15–65). The linear relationship observed between 15 and 65 may not continue beyond this range. Biological, social, or economic factors could change the pattern for older ages. The prediction is unreliable and should be treated with caution.
MULTI-STEPGiven: &xbar; = 12, &ybar; = 50, r = 0.91, b = 3.5. Find a, write the equation, and predict y when x = 15.
+
Model Answer
Step 1: Find a using the mean point: &ybar; = a + b&xbar;
50 = a + 3.5(12) = a + 42
a = 8

Step 2: Equation: ŷ = 8 + 3.5x

Step 3: Predict: ŷ = 8 + 3.5(15) = 8 + 52.5 = 60.5

Since r = 0.91 (strong positive), the model fits well. If x = 15 is within the data range, this is interpolation and reliable. If it is outside the range, this would be extrapolation and less reliable.
EXPLAINA scatter plot shows a clear U-shaped pattern but Pearson's r = 0.05. Explain why.
+
Model Answer
Pearson's r only measures linear correlation. A U-shaped (quadratic) pattern has a clear relationship, but it is not linear — as x increases, y first decreases and then increases (or vice versa). The positive and negative parts cancel out, giving r ≈ 0. This shows that r ≈ 0 does not mean “no relationship” — it means no linear relationship. Always check the scatter plot visually.

Flashcard Review

Tap each card to reveal the answer. Try to answer from memory first.

What does Pearson's r measure?
The strength and direction of the linear correlation between two variables. Range: −1 ≤ r ≤ 1.
Tap to reveal
What does r = −0.9 indicate?
A strong negative linear correlation — as one variable increases, the other decreases in a nearly linear pattern.
Tap to reveal
What is the equation of a regression line?
ŷ = a + bx, where b is the gradient (slope) and a is the y-intercept.
Tap to reveal
What is interpolation?
Using the regression line to predict values within the range of the original data — generally reliable.
Tap to reveal
What is extrapolation?
Using the regression line to predict values outside the range of the original data — unreliable because the trend may not continue.
Tap to reveal
How do you calculate a residual?
Residual = observed value − predicted value = y − ŷ. Positive = above the line; negative = below.
Tap to reveal
What does r² = 0.81 mean?
81% of the variation in y is explained by the linear relationship with x. 19% is due to other factors.
Tap to reveal
Does correlation imply causation?
No. A strong correlation does not prove causation. There may be confounding variables, reverse causation, or coincidence.
Tap to reveal
What point does the regression line always pass through?
The mean point (&xbar;, &ybar;). This is a property of the least-squares method.
Tap to reveal
What does r = 0 mean?
No linear correlation. But there may still be a non-linear relationship — always check the scatter plot.
Tap to reveal
What is a confounding variable?
A hidden third variable that influences both x and y, creating a misleading correlation between them.
Tap to reveal
How does an outlier affect the regression line?
A single outlier can dramatically change both the gradient of the regression line and the value of r, distorting the model.
Tap to reveal
What is a strong positive correlation range?
0.7 ≤ r < 1. Both the strength (strong) and direction (positive) must be stated.
Tap to reveal
Why does sample size matter for r?
Small samples give unreliable r values — a high r could occur by chance. Large samples give more confidence that the correlation is genuine.
Tap to reveal
What does the gradient b represent in ŷ = a + bx?
For every 1-unit increase in x, y is predicted to change by b units. b > 0 means positive slope; b < 0 means negative slope.
Tap to reveal

Practice Test — 20 Questions

0Score / 20
Q 1 / 20
Correct
Wrong
Score