Mountains, with a meadow beneath them

Topics and Themes

Integration of environmental and ecological modelling

Statistical methods for ecological and environmental data have largely evolved independently. The climate crisis, and knock-on consequences for the biosphere, provides a strong argument for this to change. What could an integrated environmental-ecological statistical modelling framework look like, and how achievable might it be?

Data fusion

Data on the cryosphere, biosphere, hydrosphere and atmosphere is no longer restricted to measurements obtained from in-situ measurement stations or field studies. Technological advances in remote sensing and scientific advances in numerical modelling have provided us with large, high-resolution spatio-temporal data sets including satellite data, and output from reanalysis and forecasting numerical models. What are the advantages of such data sets, and what are the limitations? Should, and if so, how can this information be combined with in-situ information in a principled statistical manner?

Across modelling paradigms

AI and machine learning algorithms appear to unravel large, complex data sets with ease, providing a tempting approach to environmental scientists faced with data of the type described above. Similarly advanced numerical models have been developed to describe many natural phenomena. Where do statistical models fit? Are hybrid statistical-ML or statistical-numerical models feasible or desirable? Should statistical modelling remain a tool predominantly for smaller, observation data sets?

From modelling to policy

It could be argued that all science, including statistical modelling, is most useful when it is used to change human behaviour. Organisations of all sizes, across all sectors, are starting to respond to the climate and nature emergency by identifying mitigation and adaptation mechanisms. Should statistical models be integral to this decision-making? If so, how can we ensure that this happens? Is there a right way for information to be communicated?

Speakers

Fergus Chadwick, Biomathematics and Statistics Scotland

Emily Dennis, Butterfly Conservation

Marc Genton, King Abdullah University of Science and Technology

Wei Zhang, University of Glasgow

Abstracts

Marc Genton: Exascale Geostatistics for Environmental Data Science

Environmental data science relies on some fundamental problems such as: 1) Spatial Gaussian likelihood inference; 2) Spatial kriging; 3) Gaussian random field simulations; 4) Multivariate Gaussian probabilities; and 5) Robust inference for spatial data. These problems develop into very challenging tasks when the number of spatial locations grows large.

Moreover, they are the cornerstone of more sophisticated procedures involving non-Gaussian distributions, multivariate random fields, or space-time processes. Parallel computing becomes necessary for avoiding computational and memory restrictions associated with large-scale environmental data science applications.

In this talk, I will explain how high-performance computing can provide solutions to the aforementioned problems using tile-based linear algebra, tile low-rank approximations, as well as multi- and mixed-precision computational statistics. I will introduce ExaGeoStat, and its R version ExaGeoStatR, a powerful software that can perform exascale (10^18 flops/s) geostatistics by exploiting the power of existing parallel computing hardware systems, such as shared-memory, possibly equipped with GPUs, and distributed-memory systems, i.e., supercomputers. I will then describe how ExaGeoStat can be used to design competitions on spatial statistics for large datasets and to benchmark new methods developed by statisticians and data scientists for large-scale environmental data science.

Wei Zhang: Analysis of batch-mark data using latent multinomial models

Batch marking is common and useful for many capture–recapture studies where individual marks cannot be applied due to various constraints such as timing, cost, or marking difficulty. When batch marks are used, observed data are not individual capture histories but a set of counts including the numbers of individuals first marked, marked individuals that are recaptured, and individuals captured but released without being marked (applicable to some studies) on each capture occasion.

Fitting traditional capture–recapture models to such data requires one to identify all possible sets of capture–recapture histories that may lead to the observed data, which is computationally infeasible even for a small number of capture occasions. In this talk, I will introduce a latent multinomial model to deal with such data, where the observed vector of counts is a non-invertible linear transformation of a latent vector that follows a multinomial distribution depending on model parameters. The latent multinomial model can be fitted efficiently through a saddlepoint approximation based maximum likelihood approach. The model framework is very flexible and can be applied to data collected with different study designs. Simulation studies indicate that reliable estimation results are obtained for all parameters of the proposed model. We apply the model to analysis of golden mantella data collected using batch marks in Central Madagascar.

Sara Martino: Spatial Occupancy models and INLA

Occupancy models are a common approach for assessing species distribution patterns. Their primary feature is the ability to model imperfect detection, which accounts for the failure to observe a species during sampling when it is, in fact, present. At present, detection–nondetection data sources are steadily increasing in both spatial extent and number of observed locations making accounting for spatial autocorrelation increasingly more important.

Modelling spatial dependence through spatially structured random effects can enhance predictive performance in occurrence probabilities across the region of interest. However, incorporating such random effects is known to be computationally expensive and can lead to intolerably long software run times when dealing with a moderately large number of locations (e.g., 100s to 1,000s of locations).In this presentation, we demonstrate how occupancy models can be integrated into the framework of latent Gaussian models, thereby leveraging the computational capabilities of Integrated Nested Laplace Approximation (INLA). INLA enables the efficient fitting of models with random effects. These random effects are not limited to spatial considerations but can also represent time, non-linear effects of covariates, space-time interactions, and more. This approach not only allows us to efficiently fit occupancy models but also makes them more flexible. We illustrate our findings using simulations and a case study

Emily Dennis: Big data for small creatures - estimating butterfly population trends from citizen science data

Robust measures of change are vital for providing reliable evidence of global insect decline. Extensive, long-running data sets on species’ abundance and distribution are available for UK butterflies. However, the path from raw data to functional outputs, such as individual species’ trends and multi-species indicators is evolving, with new statistical approaches and challenges presented by non-standardised sampling. This talk will outline the statistical methods used for producing UK butterfly trends and explore future opportunities and ongoing challenges.

An increasing variety of data sources, particularly from citizen science, presents opportunities but often requires new statistical methods. One example is the UK Big Butterfly Count (BBC), which commenced in 2010 and attracts participation from people who may have little/no experience of biodiversity monitoring. During a 3-week sampling period this summer 1.6 million butterflies were counted by over 100,000 citizen scientists.

BBC data offer the potential to produce trends for habitats which are under-sampled by traditional transect monitoring, such as gardens and urban areas. However, the short BBC sampling season renders counts susceptible to bias caused by interannual variation in the timing of species’ flight periods. I will describe a new method that corrects for this bias using flight period estimates from standardised monitoring data. Suitable methods are needed for a range of data sources to ultimately provide a strong evidence-base for policy development, management, and conservation.

Ruth O’Donnell: Using novel data streams and statistical models to unlock insights into freshwater environments

In an era of rapidly expanding environmental data, it is crucial to fully harness this wealth of information in order to understand the state and dynamics of our natural resources over time and space. This presentation offers an overview of statistical approaches which we have developed to extract the maximum insight from data on freshwater environments.

We will consider the application of functional data analysis to in-situ reflectance data, defining optical water types that enhance the processing of earth observation data. Subsequently, we explore how this processed earth observation data aids in identifying local and global spatial patterns in lake water quality. Lastly, we introduce our new project MOT4, which will explore the monitoring and modelling of river water quality across UK catchments, employing cutting-edge data analysis techniques and combining a range of both environmental and ecological measurements.

Dave Miller: Yes! You can do that in mgcv

There are lots of fancy ways to write complex ecological models in R (or via various extension languages) but sometimes you can get a lot done using software that already exists.

In this talk I'll explore Jenny Bryan's quote "if all models are wrong, then why not start with one that you understand?" and chat through how to construct some more complicated models using the generalized additive modelling workhorse mgcv.

Lily Gouldsbrough: Improving Air Pollution Forecasts with Machine Learning-Based Post-Processing

Reliable air pollution forecasts play a crucial role in alerting vulnerable individuals to elevated pollution levels and warning the general public about potentially harmful air pollution episodes. In the context of UK air pollution forecasts, the current practice involves utilising an atmospheric chemistry transport model and applying Ordinary Kriging for post-processing.

This method addresses recent biases in model outputs and measurements. A machine learning-based alternative is presented that can leverage a wide array of co-variates, enabling it to capture complex non-linear relationships present in the data. This capability translates to improved predictive accuracy, including at locations where the model has not been explicitly trained.

Rachael Duncan: Tackling environmental data – missing data and big data

This talk will look at two environmentally motivated problems. The first problem we will look at is whether we can make use of modelled data to recover missing observation data. Performing accurate statistical inference requires high-quality datasets.

However, real-world datasets often contain missing variables of varying degrees both spatially and temporally. Alternatively, modelled datasets can provide a complete dataset, but these are often biased. By conceptualising this bias as a skew, we consider a skew Kalman approach to bias-correct the modelled surface-level ozone data and use the bias-corrected data to infill missing data in the observed dataset. The second problem this talk will consider is how well existing methods for big data hold up for spatial data. Spatial models such as Gaussian processes scale cubically with the number of data points and thus quickly become computationally infeasible for moderate to large datasets. Divide and conquer methods allow data to be split into subsets and inference is carried out on each subset before recombining. While well documented in the independent setting, these methods are less popular in the spatial setting. This talk evaluates the performance of divide-and-conquer methods in the spatial setting, using USA temperature data, to achieve approximate results compared to carrying out inference on the full dataset.

Dan Clarkson: Extreme temperatures on the Greenland ice sheet: the challenges of working with environmental data

Rises in global temperatures caused by climate change have contributed to significant melt on the Greenland ice sheet over the past six decades. This can be seen from increasingly high ice surface temperatures and spatial extreme temperature events that cause large areas of the ice sheet to melt simultaneously.

In this talk, we discuss the challenges of working with environmental data and how we can adapt models and methods to produce more representative and useful results. In particular, we focus on identifying melt from data that has a soft upper limit in temperatures around 0◦C caused by the melt process of ice, and the difficulties in applying extreme value analysis models to data with poorly defined upper tails. We examine the approach of building up our methods in light of these challenges: progressing from single-site modelling using Gaussian mixture models, to spatial modelling of temperatures using Gaussian processes, to modelling spatial melt events using the Spatial Conditional Extremes model.