The Modern Biologist’s Challenge: Data Management
Modern biologists often do most of their most integral work not deep in a forest, but sitting behind a laptop while fuelling their caffeine addictions (Image Credit: gdsteam, CC BY 2.0, Image Cropped)
When you are asked to picture a biologist, chances are that many will picture someone like Jane Goodall or David Attenborough: a determined scientist wearing a zip-off pants and a pair of sturdy boots making their way through the thick vegetation of a remote Pacific island to study the intricate social behaviour of an elusive ground-dwelling mammal. Yet these days a large portion of modern biologists embark on very different journeys. Equipped with a computer full of code and mathematical models, they venture through a jungle of spreadsheets and tables filled with row upon row of data.
First of all, some nuance is needed. I might fit the picture of the biologist who only leaves their office to refill their coffee mug or cool down after another computer meltdown, but the majority of biologists do fit the above description of the ‘traditional biologist’ to varying degrees. They might spend time out in the field, growing plants in greenhouses or cultivating microorganisms in the lab. But nowadays they’re almost all spending some time wrangling, analyzing and visualizing data behind their computers. And as this type of scientist has slowly become the norm, the amount of biological data floating around has grown exponentially. And this comes with a whole new set of challenges.
The Challenges of Data Management
Good data management is fundamental to produce high quality research. It starts with the creation and collection of data. Even if the process involves clear protocols, calibrated measuring devices and well-trained volunteers, students or researchers, the many people that are often involved in data collection will introduce errors and biases. Identifying sources of potential error and bias and documenting these explicitly will make it possible to account for them at a later stage, yet often it’s hard to do this.
After collection, data are digitized and converted into a format suitable for subsequent analyses. During this process, a researcher, often with a particular study or research project in mind, makes any number of of small, seemingly insignificant decisions that determine how the data are structured. The number of files to store the data in, variable names and data types might be logical to the researcher who processed the data, but might not appear so obvious to their student. Metadata or similar files and quality checks are often missing, so it is difficult to figure out how to interpret the content of the data. Choosing a consistent, intuitive format that is also usable in future work is not easy. As biologists are rarely trained in data management, the typical dataset may be a database manager’s worst nightmare: unorganized, inaccurate and inefficient.
Data management does not only entail the creation and processing of data; it also includes sharing and reusing data by the scientific community. It has become increasingly common to be asked to share the data used in a scientific paper. Online repositories as Dryad – a community-led platform that is committed to making data available for research and educational reuse – or code-sharing platforms like GitHub are often used, but the available data is often a mere summary of the actual data used. It is not so surprising: imagine being a researcher responsible for the long-term individual-level monitoring of a species that is very dear to them. It can be very frightening to make years and years of commitment and valuable information available to the public, as it means that other researchers can incorporate that data into their own papers, even before you’ve had a chance to publish your own research. Sharing data can, however, be very valuable for the visibility and influence of the owner’s research, encourage collaborations and new research ideas, and improve transparency – a theme of increasing importance in the Open Access movement.
Community Standards and Initiatives
The challenges described above become even clearer when one integrates data from different sources. Inconsistencies and errors accumulate, and the many different formats and data structures make the conversion of these data in a usable format difficult and time consuming. Luckily, there are some initiatives out there that recognise the problems with data management.
Community data standards are one way to tackle the infinite number of formats. Community data standards are, as the name implies, data formatting standards commonly used by a community. One of the most widely used data standards is Darwin Core, a standard that offers a clear and flexible framework for compiling biodiversity data using a glossary of terms, but there are numerous data standards tailored for specific research fields (e.g., Open Traits Network, a community of researchers and institutions working towards the standardisation and integration trait data, and SPI-Birds, a network and database with a community-defined, standardized method for formatting data on hole-nesting birds).
Progress towards integration of data from different sources has also been made through databases and initiatives as the Global Biodiversity Information Facility (GBIF), an international network and research infrastructure with the aim to provide open access to biodiversity data, GenBank, a database of all publicly available DNA sequences, and FORCE11. Using the FAIR principles, this community of researchers, librarians, publishers and funding agencies intends to provide guidelines to improve the findability, accessibility, interoperability (i.e., the ability to integrate with other data sources) and reusability of data and other digital research objects.
Biodiversity is facing unprecedented challenges like climate change, invasive species and habitat loss. To better understand the consequences of these pressures on biodiversity, data from different disciplines need to be integrated, which is only possible if individual datasets are well-managed, interoperable and publicly available.
To find out more about modern data management challengers, read our interview with GBIF’s Head of Informatics Tim Robertson, linked below.
Stefan Vriend is a population ecologist working as a PhD student at the Norwegian University of Science and Technology. Through his work on the spatial variation of hole-nesting bird demography, life history and phenotypic selection he got involved in the SPI-Birds Network and Database. You can read more about his research here, read more of his articles on Ecology for the Masses here or follow him on Twitter here.