Data Analysis

Collect, organise, and analyse data using statistical measures and graphs. Explore relationships between variables using scatter plots and correlation.

What You'll Learn

  • Calculate mean, median, mode, and range from raw data and frequency tables
  • Find quartiles, IQR, and identify outliers
  • Construct and interpret box-and-whisker plots
  • Read and create histograms with equal and unequal class widths
  • Draw and interpret scatter plots with lines of best fit
  • Describe correlation and understand why correlation ≠ causation

IB Assessment Focus

Criterion A: Select and apply appropriate statistical methods to unfamiliar datasets.

Criterion C: Interpret results using correct statistical language.

Criterion D: Use statistics to draw conclusions in real-world contexts and discuss limitations.

Key Vocabulary

TermDefinition
PopulationThe entire group being studied
SampleA subset of the population used to represent the whole
Discrete dataCounted data; specific values (e.g., number of siblings)
Continuous dataMeasured data; can take any value in a range (e.g., height, time)
FrequencyHow often a value or class occurs
Cumulative frequencyRunning total of frequencies up to and including each class

Measures of Central Tendency

These "averages" describe the centre of a dataset. Each has strengths and weaknesses.

MeasureHow to CalculateBest ForWeakness
MeanSum of all values ÷ number of valuesSymmetrical data with no outliersAffected by extreme values (outliers)
MedianMiddle value when data is orderedSkewed data; when outliers are presentIgnores most of the actual values
ModeMost frequently occurring valueCategorical data; most popular itemMay not exist or may not be unique

Calculating the Mean

Mean
Mean = Σxn   (sum of values ÷ number of values)
Worked Example — Calculating the Mean
Data: 4, 7, 8, 10, 11   (n = 5) given dataset
Sum = 4 + 7 + 8 + 10 + 11 = 40 add all values
Mean = 405 divide by n
Mean = 8
Mean from a frequency table:
Score (x)Frequency (f)f × x
326
4520
5315

Mean = Σ(f × x)Σf = 6 + 20 + 152 + 5 + 3 = 4110 = 4.1

Finding the Median

  1. Order the data from smallest to largest.
  2. If n is odd: median is the middle value (position: (n+1)/2).
  3. If n is even: median is the mean of the two middle values.
Example (odd n): Data: 3, 5, 7, 9, 12. n = 5. Median position: (5+1)/2 = 3rd value = 7.
Example (even n): Data: 2, 4, 6, 8. n = 4. Middle values: 4 and 6. Median = (4+6)/2 = 5.
Common Mistake: Forgetting to order the data before finding the median. The median is the middle of the ordered data, not the middle of the list as given!

Measures of Spread

Spread measures tell you how spread out or clustered the data is.

MeasureFormulaMeaning
RangeMax − MinTotal spread; affected by outliers
IQRQ3 − Q1Spread of middle 50%; resistant to outliers

Quartiles

Quartiles divide ordered data into four equal parts:

  • Q1 (Lower Quartile): Median of the lower half of data (25th percentile)
  • Q2 (Median): Middle value of the entire dataset (50th percentile)
  • Q3 (Upper Quartile): Median of the upper half of data (75th percentile)
Example: Data: 2, 3, 5, 7, 8, 10, 12, 14, 15

n = 9. Median (Q2) = 8 (5th value).
Lower half: 2, 3, 5, 7. Q1 = (3+5)/2 = 4
Upper half: 10, 12, 14, 15. Q3 = (12+14)/2 = 13
IQR = Q3 − Q1 = 13 − 4 = 9
Range = 15 − 2 = 13

Five-Number Summary

Every box plot is based on five key values:

ValueName
MinSmallest value (excluding outliers)
Q1Lower quartile (25th percentile)
Q2Median (50th percentile)
Q3Upper quartile (75th percentile)
MaxLargest value (excluding outliers)

Identifying Outliers

Outlier Rule
Outlier < Q1 − 1.5 × IQR   or   Outlier > Q3 + 1.5 × IQR
Example: Q1 = 4, Q3 = 13, IQR = 9

Lower fence: 4 − 1.5(9) = 4 − 13.5 = −9.5
Upper fence: 13 + 1.5(9) = 13 + 13.5 = 26.5
Any value below −9.5 or above 26.5 is an outlier.

Statistical Graphs

Different types of data require different graphical representations.

Box-and-Whisker Plots

A box plot displays the five-number summary visually. The "box" spans Q1 to Q3, with a line at the median. "Whiskers" extend to the min and max (or to the fences if there are outliers).

How to draw a box plot:
  1. Draw a number line covering the data range.
  2. Mark Q1, median, and Q3 and draw the box.
  3. Draw whiskers from the box to the min and max.
  4. Mark any outliers as individual dots beyond the whiskers.
Interpreting box plots:
  • A wider box means greater IQR — more spread in the middle 50%.
  • If the median is closer to Q1, data is positively skewed (tail to the right).
  • If the median is closer to Q3, data is negatively skewed (tail to the left).
  • Use box plots to compare two datasets: compare medians, IQRs, and ranges.

Histograms

A histogram represents continuous data. Unlike bar charts, bars are adjacent (no gaps) and the area of each bar represents frequency.

Key rules:
  • x-axis: continuous scale (not categories)
  • y-axis: frequency or frequency density
  • Equal class widths: height = frequency
  • Unequal class widths: height = frequency density = frequency ÷ class width
Common Mistake: Don't confuse histograms with bar charts. Bar charts have gaps between bars and are for categorical/discrete data. Histograms have no gaps and are for continuous data.

Cumulative Frequency Graphs

  1. Calculate running totals (cumulative frequency).
  2. Plot points at the upper class boundary against cumulative frequency.
  3. Join with a smooth S-curve (ogive).
  4. Read off the median (at n/2), Q1 (at n/4), Q3 (at 3n/4).

Correlation & Scatter Plots

Scatter plots show the relationship between two numerical variables. Correlation describes the strength and direction of that relationship.

x y line of best fit strong positive correlation
Scatter plot showing strong positive correlation · Amber dashed line of best fit passes through the data trend · As x increases, y increases consistently

Types of Correlation

CorrelationPatternReal-World Example
Strong positivePoints close to an upward line; as x ↑, y ↑Height vs. arm span
Weak positiveGeneral upward trend but scatteredHours of sunshine vs. ice cream sales
Strong negativePoints close to a downward line; as x ↑, y ↓Temperature vs. heating bill
Weak negativeGeneral downward trend but scatteredAge of car vs. sale price
No correlationNo visible patternShoe size vs. exam score

Line of Best Fit

  • A straight line drawn through the data that best represents the trend.
  • Roughly equal numbers of points should be above and below the line.
  • The line should pass through or near the mean point (&xmacr;, &ymacr;).
  • You can use the line to interpolate (estimate within the data range) or extrapolate (estimate beyond the data range — less reliable).
Critical Rule: Correlation ≠ Causation. Just because two variables are correlated does NOT mean one causes the other. There may be a confounding variable (a third factor causing both), or the correlation may be coincidental. Example: ice cream sales and drowning rates are positively correlated — but the cause is hot weather, not ice cream.

Interpolation vs Extrapolation

Interpolation Estimating within the range of data. More reliable because the trend has been observed in this region.
Extrapolation Estimating outside the range of data. Less reliable because we don't know if the trend continues beyond our data.

Worked Examples

Full solutions with reasoning and interpretation.

EXAMPLE 1Data: 12, 15, 18, 20, 22, 25, 28, 30, 45. Find the mean, median, and discuss which is more appropriate.
+
Full Solution
Mean: (12+15+18+20+22+25+28+30+45)/9 = 215/9 = 23.9

Median: Data is already ordered. n = 9, position = (9+1)/2 = 5th value = 22

Discussion: The value 45 is much larger than the rest and pulls the mean upward. The median (22) is more appropriate because it is not affected by this outlier and better represents the "typical" value.
EXAMPLE 2From a frequency table: Class 10–20 (f=4), 20–30 (f=7), 30–40 (f=5), 40–50 (f=3), 50–60 (f=1). Estimate the mean and find the modal class.
+
Full Solution
Midpoints: 15, 25, 35, 45, 55
Σ(f × midpoint): 4(15) + 7(25) + 5(35) + 3(45) + 1(55) = 60 + 175 + 175 + 135 + 55 = 600
Σf: 4 + 7 + 5 + 3 + 1 = 20
Estimated mean: 600/20 = 30

Modal class: The class with the highest frequency is 20–30 (f = 7).
EXAMPLE 3Data: 3, 5, 7, 8, 12, 14, 16, 19, 21, 24. Find Q1, Q2, Q3, IQR, and identify any outliers.
+
Full Solution
n = 10 (even). Q2 (median): mean of 5th and 6th values = (12+14)/2 = 13
Lower half: 3, 5, 7, 8, 12. Q1 = 7 (3rd value)
Upper half: 14, 16, 19, 21, 24. Q3 = 19 (3rd value)
IQR: 19 − 7 = 12
Outlier test: Lower fence: 7 − 1.5(12) = −11. Upper fence: 19 + 1.5(12) = 37.
All values fall within [−11, 37], so there are no outliers.
EXAMPLE 4A scatter plot shows study hours vs test scores with a strong positive correlation. A student studies 5 hours and scores 75%. Using interpolation, estimate the score for 3 hours. Discuss the reliability of predicting the score for 20 hours.
+
Full Solution
Interpolation (3 hours): If the line of best fit gives approximately 60% for 3 hours, this is a reasonably reliable estimate because 3 hours falls within the data range.

Extrapolation (20 hours): This would be unreliable because 20 hours is far beyond the data collected. The positive correlation may not continue — there is likely a ceiling effect (scores cannot exceed 100%), and studying beyond a certain point may produce diminishing returns. Extrapolation assumes the trend continues, which we cannot verify.
EXAMPLE 5Compare two datasets using their box plots: Class A (min=45, Q1=55, median=65, Q3=72, max=88) and Class B (min=30, Q1=48, median=62, Q3=80, max=95).
+
Full Solution
Centre: Class A (median 65) has a slightly higher median than Class B (median 62), suggesting Class A performed slightly better on average.

Spread: Class A IQR = 72 − 55 = 17. Class B IQR = 80 − 48 = 32. Class B has a much larger IQR, meaning the middle 50% of Class B's scores are more spread out — more variation in performance.

Range: Class A: 88 − 45 = 43. Class B: 95 − 30 = 65. Class B has a wider overall range.

Conclusion: Class A performed more consistently (smaller IQR and range) with a slightly higher median. Class B had more variable performance.

Practice Q&A

Attempt each question before revealing the answer.

CALCULATEData: 6, 8, 3, 12, 5, 8, 9, 8, 11, 10. Find the mean, median, and mode.
+
Model Answer
Ordered: 3, 5, 6, 8, 8, 8, 9, 10, 11, 12.
Mean: (3+5+6+8+8+8+9+10+11+12)/10 = 80/10 = 8
Median: n = 10 (even). Middle values: 8 and 8. Median = 8
Mode: 8 appears 3 times (most frequent) = 8
ANALYSEA dataset has mean = 50, but median = 35. What does this suggest about the distribution?
+
Model Answer
The mean is significantly higher than the median, which suggests the data is positively skewed (skewed to the right). There are likely one or more high outliers pulling the mean upward. In this case, the median is more representative of the typical value.
DESCRIBEA scatter plot of temperature vs hot chocolate sales shows a strong negative correlation. Describe and explain this relationship.
+
Model Answer
As temperature increases, hot chocolate sales decrease (and vice versa). This makes sense because people buy more hot drinks in cold weather. The strong negative correlation means the relationship is consistent. However, correlation does not prove causation — other factors (season, holidays) might also influence sales.
JUSTIFYExplain why extrapolating from a scatter plot of student age (11–18) vs height to predict the height of a 30-year-old would be unreliable.
+
Model Answer
Extrapolation is unreliable because the trend observed in the data (ages 11–18) may not continue beyond the data range. Growth rate typically slows and stops in the late teens. Predicting height at 30 would assume linear growth continues, which contradicts biological reality.
COMPAREDataset A has IQR = 5 and Dataset B has IQR = 20. What does this tell you?
+
Model Answer
Dataset A has a much smaller IQR, meaning the middle 50% of its data is more concentrated (less spread). Dataset B's data is more variable — there is a wider range of values in the central half. Dataset A is more consistent.
EXPLAINWhy is the IQR sometimes preferred over the range as a measure of spread?
+
Model Answer
The IQR is preferred because it is resistant to outliers. The range uses only the maximum and minimum values, which can be extreme and unrepresentative. The IQR measures the spread of the central 50% of data, giving a more robust measure of typical spread.

Flashcard Review

Tap each card to reveal the answer.

How do you calculate the mean?
Sum of all values ÷ number of values.
Mean = Σx / n
Tap to reveal
How do you find the median?
Order the data. If n is odd, take the middle value. If n is even, take the mean of the two middle values.
Tap to reveal
What is the mode?
The most frequently occurring value. A dataset can have no mode, one mode, or multiple modes.
Tap to reveal
What is the IQR?
Interquartile Range = Q3 − Q1. It measures the spread of the middle 50% of data.
Tap to reveal
How do you test for outliers?
Outlier if value < Q1 − 1.5×IQR or value > Q3 + 1.5×IQR
Tap to reveal
Five-number summary?
Min, Q1, Median (Q2), Q3, Max. Used to construct box-and-whisker plots.
Tap to reveal
Histogram vs bar chart?
Histogram: continuous data, no gaps, area = frequency. Bar chart: categorical data, gaps between bars.
Tap to reveal
What is correlation?
The strength and direction of the relationship between two variables. Can be positive, negative, or none.
Tap to reveal
Correlation ≠ causation. Why?
Correlation shows association, not cause. A third confounding variable may cause both, or the correlation may be coincidental.
Tap to reveal
What is interpolation?
Estimating within the range of observed data. More reliable than extrapolation.
Tap to reveal
What is extrapolation?
Estimating beyond the range of observed data. Less reliable because the trend may not continue.
Tap to reveal
When is the median better than the mean?
When data is skewed or has outliers. The median is resistant to extreme values.
Tap to reveal
Frequency density = ?
Frequency ÷ class width. Used in histograms with unequal class widths.
Tap to reveal
What does a positively skewed box plot look like?
Median closer to Q1. Longer whisker on the right. The "tail" extends to the right (higher values).
Tap to reveal
What is cumulative frequency?
A running total of frequencies. Used to draw cumulative frequency graphs (ogives) and read off quartiles.
Tap to reveal

Practice Test — 20 Questions

0Score / 20
Q 1 / 20
Correct
Wrong
Score