Getting to Know Your Long-Term Monitoring Data

Posted on September 15, 2022 by Sara Stoudt One comment

Title Image Credit: Tony Webster, CC BY-SA 2.0, Image Cropped

Nature is complicated and the environment is vast. How can we possibly learn all there is to know about our surroundings? Aspects of our natural world like life population dynamics and life histories influence the very survival of species, but understanding these requires data from long time periods. Luckily, technology and the remarkable commitment of some scientists have meant that we are making progress in data collection by establishing many long-term monitoring networks that collect a variety of information at many different locations. Weather sensors keep track of temperature, precipitation, and wind speed. Air and water-quality sensors keep track of what is in the air we breathe and the water we drink.

Yet even with all this groundbreaking data, our task is far from over. With many variables, collected at many sites, over long periods of time, we are bound to lose or miss some data along the way. A battery dies or a sensor gets knocked over. These little blips can really add up over time, leaving us with a missing data mystery. How can we start to understand what is missing and strategize about how that missingness may affect the questions we want to answer?

Let’s start with a best case scenario: you have a nice spreadsheet where the rows represent sensors at different sites, a subset of the columns represent the measurements collected at different time points, and a subset of the columns represent information about the sensor or site itself. In this format, any missing values will be explicit; a blank space or NA will mark a place where we do not have the information.

Maybe there are just a few blips here and there but otherwise we have a fairly complete picture of what is going on. Great!

Sketch of a spreadsheet where each of N rows is a sensor and T columns represent different time points. There are also columns representing latitude, longitude, elevation, sensor type, and battery type. A few random cells are grayed out to represent missing data.

But what if we aren’t so lucky. What might have happened if we saw something like this?

Alert! A sensor malfunction! A battery died or an animal knocked down Sensor 1 after time point 4. That’s good to know. We should probably go check on that in the field. Now what if we see something like this?

Maybe a central power source flickered at Time Point 2. And there could be other patterns in the missingness that we can hunt down. For example, maybe there is a correlation between the sensor type and when the missingness occurs (as shown below). One sensor type seems to be flickering on and off every other day.

Sometimes though, we have to do a little more work to understand what we don’t have. What if our original data is structured like this?

A tall spreadsheet with three columns: sensor ID, time point, and measurement. There are three measurements for Sensor 1 and Sensor 2 but each is missing a measurement for a different time point between 1 and 4.

Now we only see what we do have and we need to infer what is missing. One way to do this is to make all possible pairs of sensor ID and time point and see which pairs are not represented in our data. Here we are missing the third time point for Sensor 1 and the first time point for Sensor 2. We could explicitly add those missing rows to our data so that we can start to investigate patterns in the missingness like we did above.

Now what if we don’t get to start with one data file but instead we have a bunch of spreadsheets coming in, maybe one file per sensor. Maybe the dates where records are made aren’t even consistent across sensors. The horror!

Now a sense of missingness is a little more ambiguous. If Sensor 1 consistently collects data on Mondays, Wednesdays, and Fridays and Sensor 2 consistently collects data on Tuesdays, Thursdays, and Saturdays, is Tuesday’s record really “missing” from Sensor 1? We’ll have to make some decisions here. Similarly if certain information is collected at some sites (for example, temperature or wind speed) and not others, that’s a similar kind of structural missigness that isn’t “wrong” per se, but is inconsistent across a broader dataset.

If you don’t really need a consistent sampling scheme to do whatever you want to do with this data, then you just need to keep in mind that patterns in availability of data per sensor may differ. If you do need consistency, you’ll have to get a bit fancier. You might want to think about filling in (or “imputing”) some of this missing data. That’s a whole other blog post, so we’ll stop there for now.

Thanks to @marney_pratt and @Gobblers_n_Fins for the suggestions that inspired this post, and if you are interested in thinking more about missing data, check out Nicholas Tierney and Allison Horst’s online book “The Missing Book”.

Have a quantitative term or concept that mystifies you? Want it explained simply? Suggest a topic for next month → @sastoudt