What’s the Deal with P-Values and Their Friend the Confidence Interval?
After the first edition of Ecology for the Masses’ new Stats Corner, many people requested a discussion of p-values. Ask and you shall receive! And as an added bonus, we’ll also talk about confidence intervals. (Image Credit: Patrick Kavanagh, CC BY 2.0, Image Cropped)
Much of ecological research involves making a decision. Does implementing a particular management strategy significantly increase the species diversity of a region? Is the amount of tree cover significantly associated with the number of deer? Do bigger individuals of a species tend to have longer life expectancies?
To answer these questions ecologists collect data and perform a statistical test, either explicitly or in the form of interpreting the significance of a coefficient (usually some sort of value relating to the effect of an environment variable, like temperature or pollution levels) in a model. The p-value is often used to help translate the results of a test or model into a decision. You’ve heard it over and over again: if the p-value is less than 0.05 we reject the null in favor of the alternative. But what does that really mean? What is the null? What is the alternative? And what is so special about 0.05?
Plenty of people have weighed in on the use of p-values. This will not be a post that judges (or applauds you) for using p-values; instead the goal of this post is to make sure readers understand what p-values really are, and where they may lead us astray.
Consider the recent study assessing bird abundance over time. A null hypothesis in this scenario is that there is no change in bird abundance over time. An alternative hypothesis is that bird abundance is decreasing over time. The patterns we might see and methods used are of course quite nuanced, but here, let’s consider a simplified scenario where we have data on the estimates of bird abundance across a series of years. We could start by performing an ordinary linear regression using the abundance as the response variable and year as the explanatory variable (yes, there are lots of good reasons not to do this, but just for the sake of argument, bear with me) to try to get some information on whether bird abundance seems to be changing over time.
When we fit this model we will get a coefficient giving us an idea of the effect of “year,” an estimated standard error for the covariate, and a p-value for the coefficient. The p-value is the probability (“p” for “probability”) that we would obtain an estimated coefficient equal to or more extreme than the one we calculated given that the null hypothesis (there is no change in bird abundance over time) is true. The intuition is that if this probability is small, it is unlikely that we got our result just by chance under the scenario of the null hypothesis, providing evidence in favor of the alternative hypothesis.
Now, I brushed past the standard error of the coefficient at first, but it is closely related to the p-value. Instead of using the p-value to help make a decision, we could use the coefficient and its standard error to create a confidence interval, which we could then use to help us make a decision. The statement that is drilled into most beginners in scientific modelling goes as follows: the 95% confidence interval means that if you replicated your study 100 times and calculated a confidence interval, 95 of them would cover the true value of your parameter of interest.
It is important to realize that this does not mean that we have a 95% chance of making the right decision based on our confidence interval. The truth either is or is not in our confidence interval. Our study setup is what is being evaluated, not the particular study result. To use a confidence interval to make a decision, we consider any value within the interval to be plausible (since 95 out of 100 calculated under our setup would cover the truth). If zero lies within the confidence interval, then it is plausible that there is no relationship between time and abundance and we fail to reject the null hypothesis of no change in abundance over time. The decision we make with the p-value and the confidence interval will be the same.
What if that data we had on our birds was a little richer. Instead of total abundance over time we have species specific abundance over time. We might fit an ordinary linear regression between abundance and year for each species, or if we are feeling fancy, use a species indicator term to obtain one model to rule them all. Now we have a different coefficient, standard error, and p-value for each species, explaining the relationship between its abundance and time. We could evaluate each p-value separately, but the more species there are, the more likely it is that we’ll get a p-value less than 0.05 just by chance. This means that we will falsely decide that the species abundance is changing over time.
The intuition here is that it is unlikely that one unlikely thing will happen to us, but it is more likely that one of many unlikely things will happen to us. Slightly more formally, you may have learned at some point that the probability of event A or event B happening was the sum of their individual probabilities (caveat, because I’m a statistician and can’t help it, the events must be independent, which is maybe not a good assumption in the case of different bird species, but forgive me). So if we add enough really small probabilities together, the sum will eventually get big enough to cross into “likely” territory. In statistics, we call this problem, the “multiple testing” problem. The good news is there are ways to adjust the p-values for how many tests we do in order to still make a decision based on them, but the bad news is that a discussion of those methods is out of the scope of this post. We’ll save that for another time if there is interest.
Have a quantitative term or concept that mystifies you? Want it explained simply? Suggest a topic for next month to @sastoudt.