## Have I Seen You Before? Counting A Species With Mark-Recapture

### When we hear news of a species resurgence or decline, it’s often accompanied by a number. Think “water vole populations have doubled in the last two decades” or “there are now only 1,400 komodo dragons left in the wild”. But how do scientists come up with those numbers? Surely they can’t have counted every single individual?

Instead, researchers have to come up with rules for using data to calculate estimates. There are some tried and true approaches for coming up with reasonable rules, or formally, estimators, but how do we make a rule for something as deceptively simple as counting?

Suppose we are interested in knowing how many deer are in a particular study site. Ecologists formally call this number “abundance,” and it seems like a natural quantity to care about. This number may help us determine whether the population is on the rise or declining, perhaps in connection to a particular disease or hunting regulations.

Our first thought might be to just go out and try to count deer as we walk through the larger site, perhaps stopping at sub-sites to record observations. Our estimator N1 is the rule “take the count of deer we see as our estimate.” Is this a good estimate of the true abundance? Well, it depends! What are all the things that can go wrong? Maybe we think it’s fairly easy to see a deer, but what if we were trying to count something small, like a butterfly? What if the deer are skittish around people? What if it’s pouring down rain when we are doing the sampling? What if we aren’t good at telling the difference between species of deer? These situations might cause us to miss some of the individuals that we cross paths with, so our count will necessarily underestimate the true abundance.

If a lot of our problems in estimating abundance seem to be coming from our ability to detect individuals, can we understand or even quantify that ability? If we knew how likely we were to miss individual animals in our count, we could actually correct our sighting number and be in the clear. For example, if I miss about 25% of animals and see 100, then I know I have missed about 0.25*100 = 25 species, and my estimate of abundance might be 125. The problem here is obvious: how do we know how many individuals we’re not seeing?

What if we just went out the next day and did a recount. Let’s call this estimate N2. Maybe this number is a little smaller than before, maybe it’s a little larger. Now this does give us extra information, but either way we still don’t know if the deer we are seeing today are the same or different from the deer we saw yesterday. Likely, N2 is a combination of old and new deer. We’ll need some way of telling the difference between the two.

This whole field experiment motivates a commonly used technique for ecologists: capture-recapture, aka mark-recapture (it’s a little friendlier). What if we tagged animals we saw on day 1 (mark) so that we could tell that we saw them before if we saw them on day 2 (re-capture)? Yes, this is going to be a more intensive process than just spotting deer, but by structuring our data collection in this way, we are adding a layer of information that we can hopefully find a way to leverage. Remember, we want to estimate our probability of detecting individuals to help us quantify how many individuals we might have missed in our initial counts.

If we were to do this capture-recapture collection, then we could partition N2 into the number of deer we saw on day 1 and day 2 (let’s call this number Nold) and the number of deer we only saw on day 2 (let’s call this number Nnew).

If Nold is a small proportion of N2, do we expect our estimate of the abundance to be closer or further away from true abundance N than if it was a large proportion of N2? We intuitively might think that if it is easy to “recapture” individuals then we might actually be doing a good job of “capturing” or sighting individuals in the first place and so our estimate of abundance won’t be super low. Can we formalize that intuition into some math? We’re up for the challenge:

Nold/N2 = N1/N

It might seem reasonable to think that the ratio of “old” individuals to today’s sightings should be roughly equal to yesterday’s sightings out of the total available individuals if all the individuals have an equal probability of being sighted in the first place. Both sides of this equality also serve as estimates of a detection probability for a given individual, which as we recall, was something that if we knew exactly would help us estimate abundance even if we didn’t see every individual. This seems promising. And behold, an estimator is born. Rearranging this equality we get an estimate for abundance: (N1 * N2)/ Nold. This is the insight that Frederick Charles Lincoln and C.G. Johannes Petersen independently had, so you might hear this estimator called the Lincoln–Petersen estimator.

Like with any statistical model, the devil is in the details. What are we sweeping under the rug here? First, we are assuming there are no “false positives”, i.e. we don’t think we spot an individual of this species when it actually isn’t there. We are also assuming there is the same number of deer to observe in the study area on day 1 as day 2. This is formally called the “closure” assumption; the study region is “closed” to births, deaths, and movement out of the study area.

There are logistical assumptions as well. Often we’ll assume individuals don’t lose their tags. And of course this whole approach is premised on each individual having the same probability of detection. These assumptions are likely not met in practice, with some individuals likely being more shy than others. One such example is moose, where the boldness of males has in the past led to more males being sighted, which in turn led to both overestimation of male populations and underestimation of female populations.

On the positive side, assumptions being broken provide great opportunities for research. One can investigate questions of the form “what happens to the quality of my estimate if *fill-in-the-blank* happens?”. We may even be able to reason about the sign of the bias which can help us make decisions about the quality of our current estimate in the meantime.

Now what if we aren’t ecologists. Can we still learn from this approach? This capture-recapture style of approach is actually used in many different fields beyond ecology. Think of all the times we want to estimate the abundance of something: number of people with a disease, number of bugs or errors in a piece of code or writing, number of people period (like in a national census). And in general, reasoning about what information would be helpful en route to our ultimate goal can help us design ways to estimate quantities of interest from data, and speculating about all of the ways things can go wrong in the data collection process can inform decisions made based on our results.

Note: This blog post is partially inspired by and draws from an extended example and lab activity my Statistical Inference Theory class is working through this semester. Check it out!

Have a quantitative term or concept that mystifies you? Want it explained simply? Suggest a topic for next month →  @sastoudt. You can read more of Sara’s work at The Stats Corner.

Title Image Credit: Petr Kratchovil, CC0 1.0, Image Cropped