Data Analysis
Collect, organise, and analyse data using statistical measures and graphs. Explore relationships between variables using scatter plots and correlation.
What You'll Learn
- Calculate mean, median, mode, and range from raw data and frequency tables
- Find quartiles, IQR, and identify outliers
- Construct and interpret box-and-whisker plots
- Read and create histograms with equal and unequal class widths
- Draw and interpret scatter plots with lines of best fit
- Describe correlation and understand why correlation ≠ causation
IB Assessment Focus
Criterion A: Select and apply appropriate statistical methods to unfamiliar datasets.
Criterion C: Interpret results using correct statistical language.
Criterion D: Use statistics to draw conclusions in real-world contexts and discuss limitations.
Key Vocabulary
| Term | Definition |
|---|---|
| Population | The entire group being studied |
| Sample | A subset of the population used to represent the whole |
| Discrete data | Counted data; specific values (e.g., number of siblings) |
| Continuous data | Measured data; can take any value in a range (e.g., height, time) |
| Frequency | How often a value or class occurs |
| Cumulative frequency | Running total of frequencies up to and including each class |
Measures of Central Tendency
These "averages" describe the centre of a dataset. Each has strengths and weaknesses.
| Measure | How to Calculate | Best For | Weakness |
|---|---|---|---|
| Mean | Sum of all values ÷ number of values | Symmetrical data with no outliers | Affected by extreme values (outliers) |
| Median | Middle value when data is ordered | Skewed data; when outliers are present | Ignores most of the actual values |
| Mode | Most frequently occurring value | Categorical data; most popular item | May not exist or may not be unique |
Calculating the Mean
| Score (x) | Frequency (f) | f × x |
|---|---|---|
| 3 | 2 | 6 |
| 4 | 5 | 20 |
| 5 | 3 | 15 |
Mean = Σ(f × x)Σf = 6 + 20 + 152 + 5 + 3 = 4110 = 4.1
Finding the Median
- Order the data from smallest to largest.
- If n is odd: median is the middle value (position: (n+1)/2).
- If n is even: median is the mean of the two middle values.
Measures of Spread
Spread measures tell you how spread out or clustered the data is.
| Measure | Formula | Meaning |
|---|---|---|
| Range | Max − Min | Total spread; affected by outliers |
| IQR | Q3 − Q1 | Spread of middle 50%; resistant to outliers |
Quartiles
Quartiles divide ordered data into four equal parts:
- Q1 (Lower Quartile): Median of the lower half of data (25th percentile)
- Q2 (Median): Middle value of the entire dataset (50th percentile)
- Q3 (Upper Quartile): Median of the upper half of data (75th percentile)
n = 9. Median (Q2) = 8 (5th value).
Lower half: 2, 3, 5, 7. Q1 = (3+5)/2 = 4
Upper half: 10, 12, 14, 15. Q3 = (12+14)/2 = 13
IQR = Q3 − Q1 = 13 − 4 = 9
Range = 15 − 2 = 13
Five-Number Summary
Every box plot is based on five key values:
| Value | Name |
|---|---|
| Min | Smallest value (excluding outliers) |
| Q1 | Lower quartile (25th percentile) |
| Q2 | Median (50th percentile) |
| Q3 | Upper quartile (75th percentile) |
| Max | Largest value (excluding outliers) |
Identifying Outliers
Lower fence: 4 − 1.5(9) = 4 − 13.5 = −9.5
Upper fence: 13 + 1.5(9) = 13 + 13.5 = 26.5
Any value below −9.5 or above 26.5 is an outlier.
Statistical Graphs
Different types of data require different graphical representations.
Box-and-Whisker Plots
A box plot displays the five-number summary visually. The "box" spans Q1 to Q3, with a line at the median. "Whiskers" extend to the min and max (or to the fences if there are outliers).
- Draw a number line covering the data range.
- Mark Q1, median, and Q3 and draw the box.
- Draw whiskers from the box to the min and max.
- Mark any outliers as individual dots beyond the whiskers.
- A wider box means greater IQR — more spread in the middle 50%.
- If the median is closer to Q1, data is positively skewed (tail to the right).
- If the median is closer to Q3, data is negatively skewed (tail to the left).
- Use box plots to compare two datasets: compare medians, IQRs, and ranges.
Histograms
A histogram represents continuous data. Unlike bar charts, bars are adjacent (no gaps) and the area of each bar represents frequency.
- x-axis: continuous scale (not categories)
- y-axis: frequency or frequency density
- Equal class widths: height = frequency
- Unequal class widths: height = frequency density = frequency ÷ class width
Cumulative Frequency Graphs
- Calculate running totals (cumulative frequency).
- Plot points at the upper class boundary against cumulative frequency.
- Join with a smooth S-curve (ogive).
- Read off the median (at n/2), Q1 (at n/4), Q3 (at 3n/4).
Correlation & Scatter Plots
Scatter plots show the relationship between two numerical variables. Correlation describes the strength and direction of that relationship.
Types of Correlation
| Correlation | Pattern | Real-World Example |
|---|---|---|
| Strong positive | Points close to an upward line; as x ↑, y ↑ | Height vs. arm span |
| Weak positive | General upward trend but scattered | Hours of sunshine vs. ice cream sales |
| Strong negative | Points close to a downward line; as x ↑, y ↓ | Temperature vs. heating bill |
| Weak negative | General downward trend but scattered | Age of car vs. sale price |
| No correlation | No visible pattern | Shoe size vs. exam score |
Line of Best Fit
- A straight line drawn through the data that best represents the trend.
- Roughly equal numbers of points should be above and below the line.
- The line should pass through or near the mean point (&xmacr;, &ymacr;).
- You can use the line to interpolate (estimate within the data range) or extrapolate (estimate beyond the data range — less reliable).
Interpolation vs Extrapolation
Worked Examples
Full solutions with reasoning and interpretation.
Median: Data is already ordered. n = 9, position = (9+1)/2 = 5th value = 22
Discussion: The value 45 is much larger than the rest and pulls the mean upward. The median (22) is more appropriate because it is not affected by this outlier and better represents the "typical" value.
Σ(f × midpoint): 4(15) + 7(25) + 5(35) + 3(45) + 1(55) = 60 + 175 + 175 + 135 + 55 = 600
Σf: 4 + 7 + 5 + 3 + 1 = 20
Estimated mean: 600/20 = 30
Modal class: The class with the highest frequency is 20–30 (f = 7).
Lower half: 3, 5, 7, 8, 12. Q1 = 7 (3rd value)
Upper half: 14, 16, 19, 21, 24. Q3 = 19 (3rd value)
IQR: 19 − 7 = 12
Outlier test: Lower fence: 7 − 1.5(12) = −11. Upper fence: 19 + 1.5(12) = 37.
All values fall within [−11, 37], so there are no outliers.
Extrapolation (20 hours): This would be unreliable because 20 hours is far beyond the data collected. The positive correlation may not continue — there is likely a ceiling effect (scores cannot exceed 100%), and studying beyond a certain point may produce diminishing returns. Extrapolation assumes the trend continues, which we cannot verify.
Spread: Class A IQR = 72 − 55 = 17. Class B IQR = 80 − 48 = 32. Class B has a much larger IQR, meaning the middle 50% of Class B's scores are more spread out — more variation in performance.
Range: Class A: 88 − 45 = 43. Class B: 95 − 30 = 65. Class B has a wider overall range.
Conclusion: Class A performed more consistently (smaller IQR and range) with a slightly higher median. Class B had more variable performance.
Practice Q&A
Attempt each question before revealing the answer.
Mean: (3+5+6+8+8+8+9+10+11+12)/10 = 80/10 = 8
Median: n = 10 (even). Middle values: 8 and 8. Median = 8
Mode: 8 appears 3 times (most frequent) = 8
Flashcard Review
Tap each card to reveal the answer.
Mean = Σx / n