The Ecological Fallacy: What Does It Have To Do With Us?
Image Credit: Erik Karits, Pixabay licence, Image Cropped
Every once and awhile the term “ecological fallacy” gets thrown around to critique a particular study. Some Twitter discussion around this pre-print, which compares COVID-19 mortality to vegetable consumption at a country level, got me thinking about the term again. So let’s go through what it is, why it’s a problem, and why sometimes it can’t be avoided.
Ecological correlation measures correlation between two variables where each observation measures a group, not an individual. For example, we could measure the correlation between income and happiness at the individual level or we could measure the correlation between average income and average happiness at the country level.
In general, the ecological fallacy is when we think these two correlations are necessarily the same. This fallacy has many flavors though. For example, it is possible for the majority of a country’s individuals to be unhappy while the average happiness score is net positive. How? Picture a right-tailed distribution of happiness scores, where there are a small minority of people who are very happy, surrounded by larger groups of people who are a little discontent. Those lucky few happy folks are boosting the average happiness score while most of the population are experiencing less cheerful conditions.
Similarly it’s also possible for there to be a negative correlation between income and happiness at the group level while at the individual level there is a positive correlation. More money can lead to a higher happiness score at the individual level, but if having a neighbor that has a high income (that therefore contributes to a higher country level income) makes an individual less happy, then this can result in an overall negative correlation between income and happiness at the group level.
This sign flip for the correlation might spark a memory of another phenomenon. Simpson’s paradox is a particular type of ecological fallacy. This occurs when there is a correlation between two variables when the data is analyzed per group that goes away or has a sign change when the data is analyzed all together. Instead of some neighbor effect, this happens because of the sizes and overall levels of the response for particular groups. If happiness is positively correlated with income within every age group but the average happiness score decreases across age groups, then when the data is analyzed as a whole, without accounting for age, a negative correlation will appear.
The ecological fallacy, in all of its many forms, is typically found referenced in the epidemiology literature, so why isn’t it called the epidemiological fallacy? What does this fallacy have to do with ecology?
The first reference to “ecological correlation” that I could find was in a paper by W. S. Robinson in 1950 (some content in the paper has not aged well), but there is no explanation for why this term was chosen. The only place I could find any explanation of the naming was in this informal source where it explains that ecology “studies the interactions between organisms and their environment. This consideration of an individual as part of something much larger is the sense in which this type of correlation is named.” If anyone has a better origin story, please let me know!
This may be a weak tie, but this doesn’t mean that ecologists are safe from falling into an ecological fallacy trap themselves. Spatial analysis where the unit of observation is a region rather than a particular coordinate can fall prey to this fallacy. Let’s take an example from disease ecology, blending the epidemiology origins with our own ecological point of view.
For example, let’s say we find a positive association between overall incidence of chronic wasting disease (CWD) in deer at a plot level and a covariate like tree cover at the individual site level. We might then be tempted to conclude that there is also a positive correlation between tree cover and the deer at a particular site’s chances of having CWD. But that seems a little more far fetched. The problem is that we’ve swept a lot of within-plot variability under the rug, and we can’t get it back without individual level response data. For instance, low population density at some sites might cause a lack of correlation between the chances of CWD and tree cover.
Say we have four plots, and a mean CWD incidence in each plot of 0.25. There are many different ways to come up with that same mean incidence. Four sites within the plot could have incidence of 0.25 (so the group mean actually is representative of every individual site) But we could also have two sites that have incidence of 0.1 and two sites that have incidence of 0.4. Now we are overestimating some sites and underestimating other sites. Things get even more complicated if there is a confounding variable that affects individual site response. The aggregation can induce some oddities due to a weird combination of the confounder distribution and the boundaries of the aggregation.
Even though group-level data can cause more headaches than we originally anticipated, sometimes this type of data is all that is available to us. In fact, there are lots of scenarios when aggregated data is all that can be provided as part of a data privacy effort. In epidemiology this is especially true if individual patient records can’t be used. In ecology, we could think about wanting to protect the individual locations for an endangered species by grouping counts together in a broader region. Even the US Census is in the process of implementing safeguards to avoid individual data being reconstructed from group data. But that’s a whole other story…
The moral of this story isn’t to give up on using aggregated data. Instead it’s a moral of communication. We just need to be clear about what the unit of analysis is and make sure not to transfer our conclusions to inappropriate units. Going from individuals to groups is doable; going from groups to individuals requires more information (or making uncheckable assumptions, which we’re never a big fan of).
Have a quantitative term or concept that mystifies you? Want it explained simply? Suggest a topic for next month → @sastoudt. You can also see more of Sara’s work at Ecology for the Masses at her profile here.
Pingback: Who Is Simpson And What Does His Paradox Mean For Ecologists? | Ecology for the Masses
Pingback: Farewell to the Stats Corner | Ecology for the Masses