Not All Datasets Are Created Equal

Image Credit: Chinmaysk, CC BY-SA 3.0, Image Cropped

Species data for understanding biodiversity dynamics: The what, where and when of species occurrence data collection (2021) Petersen et al., Ecological Solutions and Evidence,

The Crux

With the rise of the internet, GPS’ and smartphones, the amount of openly available species occurrence data has reached previously unfathomable numbers. This increase is mostly due to the engagement of the citizen scientist – regular people getting out there in nature and taking part in data collection and research. From people taking photos of flowers in their backyard to organised salamander spotting safaris, citizen scientists have opened up data that previously would have cost massive amounts to produce.

The Global Biodiversity Information Facility (GBIF) is the largest hub of such data, collating data ranging from amateur observation to museum specimens to professional surveys. It is well-known, however, that this kind of openly available data comes with a myriad of caveats: some species groups are reported much more than others (I am looking at you, bird-watchers), and “roadside bias” (see Did You Know?) haunts the records. But how are the records distributed among different land-cover types on a country-scale, does it differ between groups of conservation concern, and does it depend on who the reporters are?

Today’s authors (myself and colleagues) looked at whether there were differences between the typical land cover that these records are found in, and whether these biases differed for vulnerable (red-listed) compared to non-native species, and whether the reporters were mainly “professionals” or citizen scientists. This paper was featured in a cross-journal Special Feature on Citizen Science.

What They Did

We gathered the ten largest datasets from Norway available through GBIF for comparisons; using occurrence records from between year 2004-2018 (approximately 5.6 million). We then overlaid these on national land-cover map, and assessed in which type of habitat each record came from. To test whether this matched with what could be expected if all these species observations had been distributed completely randomly, we used computer simulations to randomly scatter the 5.6 million points across the map, looked at where they fell, repeated this simulation 100 times*, and then compared the actual data to this hypothetical range of random distributions.

*A word of warning: doing this, even on a high-power remote server took approximately two weeks. Do not do this at home, unless you are planning to kill your laptop.

Did You Know?

“Roadside bias” is a term created to describe the fact that if you were to determine anything about biodiversity solely based on the number of species observed anywhere, it seems that distance to cities and roads would be a good predictor: the closer to a road or a city you are, the more species have often been observed. However, this does not reflect the truth – it only reflects that many more observations have been reported from such areas. In short, the number of observed species might be higher, but the number of species per number of records are not – people are simply more likely to report something close to where they live, have a cabin, or go on a road trip, than they are to report them from very secluded areas.

What They Found

The data found in GBIF is not randomly distributed – Citizen Scientists report red-listed species much more frequently than what we would expect; it seems to be more prestigious to report a rare, threatened species than an invasive one. The number of records reported from different habitat types are not randomly distributed either, and these biases are even more pronounced when we look at red-listed and non-native species. Human-affected land-cover types (such as cities and agricultural land) are heavily oversampled, whereas remote- or inaccessible areas are generally under-sampled. Likewise, whether the different data-sets mainly consisted of Citizen Science observations or “professional” records (such as museum datasets) affected this bias as well, with Citizen Scientists particularly operating within cities and similar areas.

The huge abundance of birdwatchers means that birds are often heavily over-represented (Image Credit: Walton LaVonda, USFWS, CC0 1.0)


The waters are muddied when you try to differentiate between “professional” and “amateur” species occurrence records – some amateurs are highly knowledgeable and far more skilled observers than many professional scientists. Likewise, datasets published by museums and alike frequently use Citizen Science programs, or store specimens donated by private collectors. Dividing datasets so sharply, without looking into the identities of the individual recorders makes the distinction between “professional” and “amateur” data very blurry.

So What?

As the proportion of Citizen Science records and “professional” records have shifted tremendously over time, this means that it is especially important to account for these biases when using GBIF data in research: if the data is used indiscriminately, any discovered patterns could merely be a shift in bias, rather than an actual effect in the real world.

Essentially, this means that a lot work has to go into getting familiar with the data you have downloaded from GBIF, and figure out what the biases might be, and how to account for them. Developing (statistical) methods for how to deal with things are field under- and in need for of further development.

Tanja Petersen s a PhD candidate at the Norwegian University of Science and Technology. She studies the effects of urbanisation, land-use and land-use changes on biodiversity, focussing on threatened and alien species. She uses records from the Global Biodiversity Information Facility (GBIF) and offical land-cover maps to track the patterns and changes over time and in space. Check out her previous articles at her Ecology for the Masses profile here or follow her on Twitter @NeanderTanja.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s