The raw material of science is data. It is from these that hypotheses are established and conclusions are reached, hence the importance of ensuring they are of the highest possible quality. To optimize the records used in environmental biology, Cristina Ronquillo from MNCN-CSIC has developed the OCCUR Shiny application, which allows users to know and establish criteria for filtering, homogenizing, and validating the data used by teams working on topics ranging from species distribution to predictions about the impact of the oncoing global changes on biodiversity. The paper presenting it is now available at Methods in Ecology and Evolution.

Biodiversity occurrence records are data about the distribution animal and plant species, which involve the observation of a specimen or signs of its trail in a specific place and time. In addition to the location, these records may include environmental data or specific measurements of the specimen. The collection of this information gathered by institutions, research personnel, and even individuals participating in citizen science projects has grown exponentially in recent decades. The data is stored in open-access repositories or data portals that are then used for scientific studies in various fields. The most important repository worldwide is the Global Biodiversity Information Facility (GBIF), which has nearly 3 billion records available. As Cristina explains, thanks to these massive data repositories, we can carry out better approaches to assess certain aspects of the state of biodiversity, reducing the costs and resources involved in sampling. However, to use these resources correctly, it is necessary to take certain measures because the criteria of each collector are different. That was the main motivation to develop OCCUR, which in fact stems from a comprehensive review of the methods and protocols currently available to clean, handle and manage species occurrence data. Our interest is to promote a series of good practices and protocols in the processes of data cleaning and preparation, just as would be done in any other phase of scientific analyses.

When using data from large repositories and biodiversity information networks, it is essential to consider the limitations of the records due to the biases and inaccuracies they include, as well as their standardization and harmonization. For example, whether all records are identified at the species level, if they include the author of the scientific name, if the measurements are standardized, or if the record has been georeferenced with the coordinates of a nearby city or the sampling location. It is also important to review the criteria for downloading or selecting certain records based on the study. Indeed, in our daily work with ecologists, we discovered that there was a certain lack of knowledge about a large number of criteria and tools that could be implemented to work with this information. In this sense, it is important to focus the data cleaning process on selecting those that are useful for answering the question we have posed, rather than on obtaining the best data.

The OCCUR application provides the scientific community with an easy-to-use tool that allows researchers to understand the possible pathways when handling data to assess which to include based on the quantity and quality of the available records. This application synthesizes the criteria and methods for processing records proposed by 25 previous works, grouping them into five modules: type and nature of the record, taxonomy, geography, temporal information, and detection of duplicates. In the end, OCCUR generates a report that outlines the selected steps in each case, facilitating the development of analyses and the inclusion, organization, and writing of methods in the scientific article describing the study in question. In those steps where it is possible, OCCUR also provides code in the R statistical language to be included in the analyses of each user.

We assessed the utility of OCCUR throughout several recent studies, including one published in Ecology and Evolution, which analyzed over 9 million moss records available for the temperate region of the Northern Hemisphere. The results of this work highlight that the various methods of processing records showed notable differences in the diversity of species observed in certain areas of Europe and North America, along with variations in the relationships between climate and biodiversity measured from these massive data sets. This has important consequences, as it implies that the quality of the starting data we use to calibrate models of the impact of global change can alter their predictions, which underscores the need for meticulous work with the processing of massive biodiversity data that can be replicated by other researchers in the future.

You can access the paper presenting OCCUR at https://jhortal.com/project/ronquillo-et-al-2024-occur-shiny-application-for-curating-occurrence-records/

And the study about Northern Hemisphere mosses at https://jhortal.com/project/ronquillo-et-al-ecol-evol-2023-impact-of-data-curation-on-the-observed-distribution-of-mosses/