Who Is Simpson And What Does His Paradox Mean For Ecologists?
Edward H. Simpson was a codebreaker at Bletchley Park, the home of Allied code-breakers during the Second World War. While you’d think this would be his claim to fame, perhaps his most lasting contribution is his description of Simpson’s paradox. The paradox describes the phenomena whereby a relationship within a dataset dramatically changes if you look at the data by group or all together. More famous examples of the paradox stem from the medical world or the famous Berkeley admissions example. But what examples can we have in mind in ecological settings to guide us? Let’s consider the dimensions of penguins’ bills compiled from Palmer Station in Antarctica. If we are interested in the relationship between the bill depth and length we might do a preliminary analysis like the following linear regression.
There appears to be a negative relationship between bill depth and length. But what if we consider the fact that there are three different species of penguins at Palmer Station. Now we might want to let each species have its own relationship between bill depth and length. When we do this we get a surprise. Each penguin species now has a positive relationship between bill length and depth.
This is Simpson’s Paradox in full force. The relationship between two variables changed signs with and without considering subgroups in the data. You can also think of this situation as arising from a confounding variable. Without accounting for species, we misrepresent the relationship between penguin bill length and bill depth.
This situation can also occur when the confounder is quantitative. Just picture a quantitative variable split up into bins, where in each bin the relationship is the opposite of what you see overall, just like in this penguin example. It may not be anything fundamental about the species difference, it could simply be that one species is bigger than another. For instance, it might seem reasonable that larger penguins would have longer bills on average. Let’s have a look at the different species sizes.
We see that at least the Gentoo penguin species tends to have heavier individuals. What if we control for body mass in the regression between the bill depth and length instead of breaking the analysis into three different regressions?
Heavier penguins do tend to have longer bills, but we can still notice the different species’ effects. For example, the cluster above that we now know represents the Gentoo species has noticeably larger body mass values. We can also visualize this model fit to the data, going up a dimension, from line to plane, to account for the more complicated model. The relationship between bill depth and bill length in the presence of body mass is positive, so Simpson’s Paradox has been foiled by controlling for body mass (although it seems like accounting for species helps more).
Simpson’s Paradox can also creep up on us when analyses are performed on aggregated data (see Qian et al, 2019 for another example). This is related to the ecological fallacy that we have talked about before where extending findings made on aggregated data to individuals is dangerous. What if I didn’t have individual penguin data but only averages for each species?
Now I know that I have no business fitting a line to these three points, but for the sake of the example, let’s press on. We see the inappropriate sign return. Even though we see from previous figures that the variation within each species is positive, when collapsed into a summary statistic, we see a negative relationship.
How can we know to look for this paradox before it hurts us in our analysis? If there is a natural grouping, we can easily fit the model with and without accounting for groups to check that we do not see any dramatic changes. Searching for a potential confounder Z is a bit trickier. A confounder is related to both your response of interest Y and an explanatory variable X that you have in the model to try to explain your response. We can investigate what the relationship between X and Y is at many different slices of Z (again, think of binning a quantitative variable Z or looking at subgroups of a categorical variable Z) if we actually have information on Z.
If we didn’t collect information on that confounder, then we have our work cut out for us. We do know that a potential confounder is related to both Y and X. That means that without Z in the model, there is still a relationship between X and Y that is driven specifically by Z and unaccounted for in the model. Where does unaccounted for variation go? The residuals! If we plot the residuals of the model v. X and see a pattern, that might hint that there is a Z on the loose. (Thanks to a blog post by Jim Frost for this tip!)
I hope that with these concrete examples in mind (penguins seem pretty memorable, right?) and some guidance about what the pain points confounding variables can lead to will help you avoid falling into any Simpson’s Paradox traps in the wild.
Have a quantitative term or concept that mystifies you? Want it explained simply? Suggest a topic for next month → @sastoudt