"Correlation doesn't imply causation" is one of the most repeated phrases in statistics โ but understanding why it matters changes how you read news, studies, and data. Here's the clear explanation.
The correlation-causation distinction is cited constantly but explained poorly. Here's what it actually means, why it matters, and how to spot the error in the wild.
Two variables are correlated when they tend to move together โ when one goes up, the other tends to go up (positive correlation) or down (negative correlation).
Causation means one variable directly causes a change in the other. Causation is much harder to establish than correlation โ it requires evidence of mechanism, temporal order (cause comes before effect), and ideally experimental confirmation.
1. Reverse causation: The direction of influence is backwards from what's assumed.
Example: Countries with more hospitals have more deaths. Conclusion: hospitals cause death? No โ sick people go to hospitals. More illness causes more hospitals AND more deaths.
2. Confounding variable: A third variable causes both the correlated variables.
Example: Ice cream sales correlate with drowning deaths. Cause? Hot weather increases both ice cream consumption AND swimming, which increases drowning risk. Remove the confounder (temperature) and the correlation disappears.
3. Spurious correlation: Two variables correlate by chance, especially in small datasets or when many correlations are tested.
Famous examples: Nicolas Cage films released per year correlates with drowning deaths in swimming pools. Per capita cheese consumption correlates with deaths from bedsheet tangling. These are statistical noise masquerading as signal.
The gold standard is a Randomised Controlled Trial (RCT): randomly assign participants to a treatment or control group, controlling for all other variables. Random assignment eliminates confounding variables on average.
When RCTs are impossible (you can't randomly assign people to smoke for 30 years), epidemiologists use the Bradford Hill criteria โ a checklist including: strength of association, consistency across studies, biological plausibility, dose-response relationship, and temporal order.
Correlation is valuable even without causation for prediction purposes. If shoe size predicts reading ability in children, you can use it to identify struggling readers โ even though buying bigger shoes won't help them read. In medicine, correlated biomarkers can predict disease risk without being causes. In finance, correlated assets inform portfolio diversification. The key is not to intervene on the correlated variable and expect to change the outcome.