A common goal of ecologists is to understand the population abundance of a particular species. We might be looking for the California condor as part of assessing how well the recovery project is going. This requires some field work, going out to a variety of sites and counting animals that we see. How do we choose which sites to go to? Even in the era of camera traps, we still need to know where to put our extra set of eyes. It would be a shame to have a particular camera not get any action due to an unlucky placement. We don’t have infinite time and money after all!
If a species is fairly prevalent, a random sample of sites might let us see plenty of animals. However, we know species distributions are rarely even across a given region. More likely the species is a bit more rare in certain areas (especially for our critically endangered condor) and/or individuals tend to cluster together. A random sample could lead us to a bunch of unfruitful site visits, despite the fact that the species is quite common in other areas close to those sites.
But what if we had a little information about the uneven distribution of condors? Adaptive sampling methods allow us to incorporate information about the structure that we’ve observed so far to help us decide where to sample next.
We head out, starting with a small random sample of sites to visit. For every site that we see at least one condor, we also sample all of the site’s neighbors. We keep doing this until we fail to see a condor at any of the previous sites’ neighbors. At that point we’ve reached the end of the cluster. We refer to sites that don’t have any condors but who are in the neighborhood of a site that does as an “edge unit.”
Now we have more information about sites where condors are actually present, but it comes at a cost. The sample mean abundance or even the mean of cluster means can be biased under this type of adaptive sampling design (since it’s no longer completely random). What do we do with this data now?
Luckily there are other estimators of abundance which account for this bias out there. (Want the details? Check out this review.) A simple estimator takes advantage of the fact that our sampling started with a small random sample. We could consider the sample mean of this starting sample as a simple estimator of condor abundance. But we went through all this trouble to collect additional data using a new sampling method. Can we do better than this simple approach?
The Rao-Blackwell Theorem
Now it’s time to switch gears and learn about some statistics theory. I promise it’ll be (relatively) painless. There is an important theorem that tells us how to improve estimators of a particular parameter of interest. In our case, the theorem will help us find a better way to estimate abundance than taking the sample mean of our starting sample. Sign us up!
The Rao-Blackwell theorem tells us that if we have an estimator, then we can obtain a new estimator that is never worse than the original. How do we do that? We take the conditional expectation of the original estimator given a sufficient statistic T. This becomes our new, Rao-Blackwellized, estimator.
That sounds great, but what is a sufficient statistic? Informally, a function of the data T is sufficient if we can’t learn anything more about the parameter of interest from the distribution of the data if we already know what T is. For example, if we are trying to estimate the population mean, we could do so using only the sample mean as our T. We wouldn’t need any of the original sample data to make a decision about our estimate, hence the sample mean is sufficient.
What do we mean by an estimator being better? The new Rao-Blackwellized estimator will have a mean-squared error that is less than or equal to the original. In fact, more general versions of the theorem even let us pick our favorite loss function (as long as it only has global optima, which many commonly used ones do), and this is still true. Score!
In our adaptive sampling example, the sufficient statistic is the set of unique observations, labeled with their site ID. In our data collection process we might revisit particular sites if they are neighbors of multiple condor sightings. We don’t need information about the double counting to help us estimate average abundance, hence the unique observations are sufficient.
Investigation of the benefits of adaptive sampling over random sampling show that the efficiency gains (less work for more information) depend on whether the within-network variance is large enough. Since we often expect large variability in the abundance of a species even within clusters of sites, this is good news for ecologists.
Why is this theorem so important? It basically means that if we design an estimator, even if it’s a wild guess, we always have a concrete way to improve it. This is especially helpful if finding even a starting point for an estimator is hard. Think about how complicated our new design is; we need all the help we can get.
Let’s close with some info on the people behind this theorem. Both of the theorem’s namesakes are powerhouses in the field of statistics. Calyampudi Radhakrishna Rao is an Indian-American mathematician and statistician who has won a variety of awards including the prestigious National Medal of Science (he also has a bound named after him). Read more about him here. David Blackwell was an American mathematician and statistician with his own set of accolades including being a member of the National Academy of Sciences (he was the first African American to be included). Learn more about him from the transcript of his oral history.
Have a quantitative term or concept that mystifies you? Want it explained simply? Suggest a topic for next month → @sastoudt
Image Credit: beeveephoto, CC BY-SA 2.0, Image Cropped
Everything that ecologists do – from saving endangered species to projecting climate change impacts – requires ecological data. Sometimes that data can be hard to come by, like when you’re trying to figure out the range of a rare moss. At other times, that data can be smack bang in front of you, but impossible to measure. The depth of a lake for instance, or the surface area of a tree. Today, we’ll look at how to overcome that second situation, by using other, more easy-to-obtain covariates to provide an estimate of the property you’re looking for.
In our last stats post, we talked at length about everything that can influence the outcome of a statistical model. The choice of parameters. The choice of data. But one thing we avoided talking about was the choice of the approach to the model itself. And that brings us to the two big approaches in statistical modelling – Bayesian vs. Frequentist.
In ecological studies, the quality of the data we use is often a concern. For example, individual animals may be cryptic and hard to detect. Certain sites that we should really be sampling might be hard to reach, so we end up sampling more accessible, less relevant ones. Or it could even be something as simple as recording a raven when we’re really seeing a crow (check our #CrowOrNo if you have problems with that last one). Modeling approaches aim to mitigate the effect on our results of these shortcomings in the data collection.
However, even if we had perfect data, when we decide how to model that data, we have to make choices that may not match the reality of the scenario we are trying to understand. Model mis-specification is a generic term for when our model doesn’t match the processes which have generated the data we are trying to understand. It can lead to biased estimates of covariates and incorrect uncertainty quantification.
After the first edition of Ecology for the Masses’ new Stats Corner, many people requested a discussion of p-values. Ask and you shall receive! And as an added bonus, we’ll also talk about confidence intervals. (Image Credit: Patrick Kavanagh, CC BY 2.0, Image Cropped)
Much of ecological research involves making a decision. Does implementing a particular management strategy significantly increase the species diversity of a region? Is the amount of tree cover significantly associated with the number of deer? Do bigger individuals of a species tend to have longer life expectancies?
When animals like these wolves travel in packs, spotting one individual means we’re more likely to spot another soon after. So how do we come up with a reliable population estimate in situations like these? (Image Credit: Eric Kilby, CC BY-SA 2.0, Image Cropped)