Tag Archives: data

The Modern Biologist’s Challenge: Data Management

Modern biologists often do most of their most integral work not deep in a forest, but sitting behind a laptop while fuelling their caffeine addictions (Image Credit: gdsteam, CC BY 2.0, Image Cropped)

When you are asked to picture a biologist, chances are that many will picture someone like Jane Goodall or David Attenborough: a determined scientist wearing a zip-off pants and a pair of sturdy boots making their way through the thick vegetation of a remote Pacific island to study the intricate social behaviour of an elusive ground-dwelling mammal. Yet these days a large portion of modern biologists embark on very different journeys. Equipped with a computer full of code and mathematical models, they venture through a jungle of spreadsheets and tables filled with row upon row of data.

First of all, some nuance is needed. I might fit the picture of the biologist who only leaves their office to refill their coffee mug or cool down after another computer meltdown, but the majority of biologists do fit the above description of the ‘traditional biologist’ to varying degrees. They might spend time out in the field, growing plants in greenhouses or cultivating microorganisms in the lab. But nowadays they’re almost all spending some time wrangling, analyzing and visualizing data behind their computers. And as this type of scientist has slowly become the norm, the amount of biological data floating around has grown exponentially. And this comes with a whole new set of challenges.

The Challenges of Data Management

Good data management is fundamental to produce high quality research. It starts with the creation and collection of data. Even if the process involves clear protocols, calibrated measuring devices and well-trained volunteers, students or researchers, the many people that are often involved in data collection will introduce errors and biases. Identifying sources of potential error and bias and documenting these explicitly will make it possible to account for them at a later stage, yet often it’s hard to do this.

After collection, data are digitized and converted into a format suitable for subsequent analyses. During this process, a researcher, often with a particular study or research project in mind, makes any number of of small, seemingly insignificant decisions that determine how the data are structured. The number of files to store the data in, variable names and data types might be logical to the researcher who processed the data, but might not appear so obvious to their student. Metadata or similar files and quality checks are often missing, so it is difficult to figure out how to interpret the content of the data. Choosing a consistent, intuitive format that is also usable in future work is not easy. As biologists are rarely trained in data management, the typical dataset may be a database manager’s worst nightmare: unorganized, inaccurate and inefficient.

Data management does not only entail the creation and processing of data; it also includes sharing and reusing data by the scientific community. It has become increasingly common to be asked to share the data used in a scientific paper. Online repositories as Dryad – a community-led platform that is committed to making data available for research and educational reuse – or code-sharing platforms like GitHub are often used, but the available data is often a mere summary of the actual data used. It is not so surprising: imagine being a researcher responsible for the long-term individual-level monitoring of a species that is very dear to them. It can be very frightening to make years and years of commitment and valuable information available to the public, as it means that other researchers can incorporate that data into their own papers, even before you’ve had a chance to publish your own research. Sharing data can, however, be very valuable for the visibility and influence of the owner’s research, encourage collaborations and new research ideas, and improve transparency – a theme of increasing importance in the Open Access movement.

Community Standards and Initiatives

The challenges described above become even clearer when one integrates data from different sources. Inconsistencies and errors accumulate, and the many different formats and data structures make the conversion of these data in a usable format difficult and time consuming. Luckily, there are some initiatives out there that recognise the problems with data management.

Community data standards are one way to tackle the infinite number of formats. Community data standards are, as the name implies, data formatting standards commonly used by a community. One of the most widely used data standards is Darwin Core, a standard that offers a clear and flexible framework for compiling biodiversity data using a glossary of terms, but there are numerous data standards tailored for specific research fields (e.g., Open Traits Network, a community of researchers and institutions working towards the standardisation and integration trait data, and SPI-Birds, a network and database with a community-defined, standardized method for formatting data on hole-nesting birds).

european-908502_1920

Whilst the ubiquity of the house sparrow means there is plenty of data on it, that data can be a nightmare to bring together (Image Credit: TK McLean, Pixabay licence)

Progress towards integration of data from different sources has also been made through databases and initiatives as the Global Biodiversity Information Facility (GBIF), an international network and research infrastructure with the aim to provide open access to biodiversity data, GenBank, a database of all publicly available DNA sequences, and FORCE11. Using the FAIR principles, this community of researchers, librarians, publishers and funding agencies intends to provide guidelines to improve the findability, accessibility, interoperability (i.e., the ability to integrate with other data sources) and reusability of data and other digital research objects.

Biodiversity is facing unprecedented challenges like climate change, invasive species and habitat loss. To better understand the consequences of these pressures on biodiversity, data from different disciplines need to be integrated, which is only possible if individual datasets are well-managed, interoperable and publicly available.

To find out more about modern data management challengers, read our interview with GBIF’s Head of Informatics Tim Robertson, linked below.

Tim Robertson: The World of Ecological Data

Stefan Vriend is a population ecologist working as a PhD student at the Norwegian University of Science and Technology. Through his work on the spatial variation of hole-nesting bird demography, life history and phenotypic selection he got involved in the SPI-Birds Network and Database. You can read more about his research here, read more of his articles on Ecology for the Masses here or follow him on Twitter here.

The Changing Face of Ecology: Part Four

This installment includes thoughts from (left to right) Dag Hessen, Erica McAlister, Rasmus Hansson and Prue Addison (Image Credits: Dag Hessen, University of Oslo; Erica McAlister, CC BY-SA 2.0; Miljøpartiet de Grønne, CC BY-SA 2.0; Synchronicity Earth, CC BY 2.0)

Running EcoMass means we get to sit down with some exceptionally interesting ecologists, conservations, and in this post, even environmental politicians. Most of these individuals have been a part of the discipline for much longer than we have, so when we get the chance we pick their brains about how ecology has changed over the past decades. It’s always interesting to hear which aspects of ecological life we take for granted simply weren’t there 40, 30 or even 10 years ago.

You can also check out parts one (link), two (link) and three (link) of our Changing Face of Ecology specials, and click on the names below to read our full interviews with each of this issue’s respondents.

Read more

Why Warmer Winters Don’t Always Help Geese

Contrasting consequences of climate change for migratory geese: Predation, density dependence and carryover effects offset benefits of high-arctic warming (2019) Layton-Matthews et al., Global Change BiologyDOI: 10.1111/gcb.14773

The Crux

Most of us know that climate change will bring warmer, shorter winters to most parts of the world. For many species in areas like the Arctic, it would be easy to interpret this as a good thing – plants grow earlier, so animals get more food, right? Naturally it’s never that simple. Many herbivorous species have evolved in sync with climate cycles so that their reproduction peaks when food becomes available. If season start dates change, these species may not be able to change their own cycles in time. Additionally, what happens if populations of their predators suddenly boom?

Today’s authors wanted to know what role a warming climate played in the population fluctuations of migratory barnacle geese (Branta leucopsis).

Read more

Tim Robertson: The World of Ecological Data

Image Credit: GBIF, CC BY 4.0, Image Cropped

When I was a child, I’d often study books of Australian birds and mammals, rifling through the pages to see which species lived nearby. My source of information were the maps printed next to photos of the species, distribution maps showing the extent of the species range. These days, many of these species ranges are declining. Or at least, many ecologists believe they are. One of the problems with knowing exactly where species exist or how they are faring is a lack of data. The more data we have, the more precise an idea we get of the future of the species. Some data is difficult to collect, but yet more data has been collected, and is simply inaccessible.

At the Living Norway seminar earlier this month I sat down with Tim Robertson, Head of Informatics and the Global Biodiversity Information Facility. GBIF is an international network that works to solve this data problem worldwide, both by making collected data accessible and by helping everyday people to collect scientific data. I spoke with Tim about the journey from a species observation to a species distribution map, the role of GBIF, and the future of data collection.

Read more

Modernising Ecological Data Management: Reflections from the Living Norway Seminar

Ecological data is constantly being collected worldwide, but how accessible is it?

Ecological data is constantly being collected worldwide, but how accessible is it? (Image Credit: GBIF, CC BY 4.0, Image Cropped)

This week Trondheim played host to Living Norway, a Norwegian collective that aims to promote FAIR data use and management. It might sound dry from an ecological perspective, but I was told I’d see my supervisor wearing a suit jacket, an opportunity too preposterous to miss. While the latter opportunity was certainly a highlight, the seminar itself proved fascinating, and underlined just how important FAIR data is for ecology, and science in general. So why is it so important, what can we do to help, and why do I keep capitalising FAIR?

Read more

Bigger is Not Better

Not all GPS coordinate data are created equal, and some of it may actually be meaningless. (Image Credit: Daniel Johansson, Pexels licence, Image Cropped)

The smartphone fallacy – when spatial data are reported at spatial scales finer than the organisms themselves (2018) Meiri, S., Frontiers of Biogeography, DOI: https://escholarship.org/uc/item/2n3349jg

The Crux

One of the greatest annoyances when using museum specimens, old datasets, or large occurrence databases (such as GBIF) is when the locality of an occurrence is only vaguely described, and the coordinate uncertainty is high; “Norway” or “Indochina” doesn’t really tell you much about where that specific animal or plant was seen. Luckily, the days where such vague descriptions were the best you could get are long gone, as most of us now walk around with a GPS in our pockets, and even community science data can be reported very accurately, and more or less in real-time.

However, we have now encountered the opposite problem: the reported coordinates of organisms are often too precise to be realistic, and in the worst-case scenario, they might be borderline meaningless. The author of this study wanted to highlight how this advance in technology coupled with our eagerness to get more accurate data and results have made us too bold in our positional claims.

Read more

« Older Entries