To Get Great (Statistical) Power, It Takes Great Responsibility
There are a lot of questions in ecological research that ask whether or not something has changed over time, or put more simply, whether two things are different – vegetation levels, climate variables, maybe species diversity.
Suppose we are monitoring nutrient levels in a lake to make sure they stay at levels that are habitable for the fish living there. A change in policy about what is allowed to be dumped into the river by local factories was enacted, and we want to see if there is evidence that the nutrient levels have deteriorated in the year following the change when compared to the year before.
In a traditional hypothesis testing framework, the null hypothesis means nothing fishy (ha) is going on here; there is no difference in the average nutrient level before and after the policy change. If we reject this null hypothesis, then we have evidence to support an alternative hypothesis that something is happening with the nutrient levels; they have either increased or decreased.
But what if we fail to reject the null hypothesis? Is there really nothing going on, or did we just not have enough evidence to identify what’s going on? If the probability that we reject the null when the alternative is really true, called the power, is low, we might be missing a real change.
“Not enough evidence” can come in a couple different forms.
- The change exists but the magnitude of the true effect, or effect size, is very small, making it hard to tease apart from the null hypothesis.
- There is something else at work obscuring the change (eg. nutrient levels would have dropped, but a rise in temperature from last year prevented that)
- We didn’t have enough data to reveal the change.
In some respects, a hard to detect, small effect size is not a terrible situation to be in. If a small effect isn’t practically significant, then we might not care if we miss it (maybe these fish won’t miss a 0.00001% drop in nutrient levels). If there’s something obscuring the change, we have ways to account for that though they require a little foreknowledge of the study system.
But what about that last one? We may have some control over the sample size if we think about it before we start the study (yes, I’m saying we need a plan). But if we plan ahead – ideally even before the change takes place, though this isn’t always possible – can we collect enough samples of the nutrient levels that will ensure we detect a change if there really is one?
We don’t have a simple way of controlling or even calculating the power for any given scenario. We would need to know the distribution of our test statistic under the alternative hypothesis which would require us to actually know what that alternative distribution actually is. So what can we do? Given a significance level, a desired power level, and an effect size, we can often calculate a recommended sample size.
We could start by thinking beyond the typical significance level of 0.01 or 0.05. This is yet another tradeoff though – increasing the significance level to say 0.1 can increase the power but also increases the probability of rejecting the null hypothesis when it is actually true. In that case, we might accuse the policy of impacting the nutrient levels when truly nothing fishy is going on. Is that worse than thinking everything is okay in the lake when it’s actually not? The people in charge of the data analysis will have to be the judge on that one. Another question for the folks investigating this change question is what power is “enough”. Are we happy if we correctly reject the null hypothesis 75% of the time? 90% of the time?
If we knew the effect size, we wouldn’t be doing this test to begin with, but we could at least speculate about it. Perhaps we have seen how this kind of policy change has played out in other bodies of water and that can help inform what we expect to see. We could choose an effect size smaller than what we actually expect to see in order to get a conservative sample size (intuitively we need more data to identify a smaller effect). We may try a few combinations of all of these power calculation toggles and consider the resulting sample sizes in the context of a budget or time constraints (all those vials for lake water don’t come cheap).
In more complicated situations, these calculations may be less straightforward, and we might want to take a simulation approach. This would require us to consider some plausible data generating processes for nutrient levels in this lake where we know the answer to the question (there is a change of a certain size), generate data from these processes many times, and perform a hypothesis test many times, all while keeping track of how often we correctly reject the null hypothesis.
Pre-planning (and to be fair, some extra work at the beginning of a data investigation) can help us when we get to the end of the analysis. If we failed to reject the null and we have planned for a power level we’ve deemed acceptable, then we at least think we had a good chance of detecting a change. The lake may be safe for now!