Statistics PhD Projects
The lists below give potential Statistics PhD projects that members of the department are interested in supervising. They are not prescriptive, but give an idea of what a PhD project in the particular area could cover. If you are interested in a project please either contact the corresponding supervisor to discuss it or mention it in the “Academic Proposal” section of the application form when you apply online.
Bayesian and Computational Statistics
Bayesian accordion Accordion
-
MCMC for reaction networks
Consider a set of species: this could be literal, such as foxes and rabbits, or a protein and its dimer, or could be classes such as people susceptible to a disease, people infected with a disease and people who have recovered from the disease. The different species interact through "reactions" (e.g. a fox eats a rabbit) and the rate of these reactions depends on the current numbers of the relevant species (e.g. the more foxes and rabbits there are, the more rabbits are eaten each day). A reaction network is the continuous-time Markov chain whose state is the current number of each species. Interest may lie in the forms of the reactions and their rates, in the current or historical species numbers, future prediction, or even all three. An example project might create a new, more efficient inference methodology for a subclass of reaction networks and apply to an autoregulatory gene network or a disease epidemic.
Biostatistics
Infectious disease modelling and inference
Infectious disease accordion Accordion
-
Inference methods for coupled within-host and individual-level models of infectious disease spread
Most individual-level models of infectious disease transmission treat the infection state of individuals as a binary variable – either infected or not – and do not use quantitative information from testing data (such as molecular and antibody data) (1). Whilst this simplifies inference of individuals’ infection states, it loses information about heterogeneity in infection levels and infectiousness between individuals which is important in transmission. Examples of Bayesian inference methods for coupled within- and between-host infectious disease models are rare (2). In part this is due to the complexity of the methods required to perform such multiscale inference, which must account for large amounts of missing data, and in part the relative lack of contemporaneous within-host and between-host data available until recent years. This project will develop multiscale models of infectious disease transmission and Bayesian inference methods for fitting them to multiscale data, including for the vector-borne disease visceral leishmaniasis and antimicrobially-resistant bacteria.
- Hay JA, Kennedy-Shaffer L, Kanjilal S, Lennon NJ, Gabriel SB, Lipsitch M, et al. Estimating epidemiologic dynamics from cross-sectional viral load distributions. Science [Internet]. 2021 Jun 3 [cited 2023 Oct 30]; Available from: https://www.science.org/doi/10.1126/science.abh0635
- Tsang TK, Perera RAPM, Fang VJ, Wong JY, Shiu EY, So HC, et al. Reconstructing antibody dynamics to estimate the risk of influenza virus infection. Nat Commun. 2022 Mar 23;13(1):1–8.
-
Leveraging high-resolution contact data for improved infectious disease forecasting
Lloyd Chapman, Chris Jewell, Jon Read (CHICAS)
Data from social contact surveys plays a vital role in forecasting short- and medium-term infectious disease dynamics. However, traditional social contact surveys are limited in the resolution and accuracy of the information they provide about how many contacts individuals have with other individuals of different ages in different settings due to the number of people that can be surveyed. Furthermore, they tend to provide static estimates of contact rates between different groups in the population which do not reflect changes in contact patterns that occur during an epidemic. The contact tracing data gathered by the UK Health Security Agency’s Contact Tracing and Advisory Service (CTAS) during the COVID-19 pandemic does not suffer these same issues, as it comprises detailed data on the contacts of approximately 20 million individuals who tested positive for SARS-CoV-2 between April(?) 2020 and February 2022. This project will use this uniquely large and high-resolution dataset to test whether using real-time estimates of contact rates in forecasting models can improve the accuracy of incidence forecasts. This will involve assessing whether it is possible to predict changes in contact patterns in response to changes in restrictions during the COVID-19 pandemic and developing an existing spatiotemporal disease short-term forecasting framework (1,2). Further work may include incorporating the detailed estimates of contact rates into a Bayesian framework for nowcasting and forecasting COVID-19 incidence at the local area level in the UK (3,4).
- Robert A, Chapman LAC, Grah R, Niehus R, Sandmann F, Prasse B, et al. Predicting subnational incidence of COVID-19 cases and deaths in EU countries [Internet]. medRxiv. 2023 [cited 2023 Oct 25]. p. 2023.08.11.23293400. Available from: https://www.medrxiv.org/content/10.1101/2023.08.11.23293400v1.abstract
- Bekker-Nielsen Dunbar M, Held L. The COVID-19 vaccination campaign in Switzerland and its impact on disease spread [Internet]. medRxiv. 2023 [cited 2023 Oct 25]. p.04.06.23288251. Available from: https://www.medrxiv.org/content/10.1101/2023.04.06.23288251v1.abstract
- Jewell CP, Hale AC, Rowlingson BS, Suter C, Read JM, Roberts GO. Bayesian inference for high-dimensional discrete-time epidemic models: spatial dynamics of the UK COVID-19 outbreak [Internet]. 2023 [cited 2023 Oct 25]. Available from: http://arxiv.org/abs/2306.07987
- Hale AC, Read JM, Jewell CP. Modelling the impact of social mixing and behaviour on infectious disease transmission: application to SARS-CoV-2 [Internet]. 2022 [cited 2023 Oct 25]. Available from: http://arxiv.org/abs/2211.02371
-
Inference for contact tracing data in epidemic models
Chris Jewell, Lloyd Chapman, Jon Read (CHICAS)
Epidemic models are routinely used to investigate disease transmission processes, and in particular for prediction and evaluation of interventions during epidemics. However, in order to be reliable models must first be trained on existing epidemic data, with key unknown quantities to be estimated, and the ensuing predictive distributions must be unbiased and fully capture uncertainty. An unsolved question in epidemic modelling is the frequency at which individuals interact with each other, which if known allows us to resolve the contribution of human behaviour from the underlying transmissibility of the pathogen.
This project will develop computationally intensive Bayesian methods to infer contact frequency within epidemic models more accurately, using both existing estimates from survey data (e.g. POLYMOD [1]) and also contact tracing data. The project will start by reviewing previous trans-dimensional MCMC methodology which uses contact tracing data to reduce posterior uncertainty and improve the sensitivity and specificity of risk predictions [2].
[1] Mossong J, Hens N, Jit M, Beutels P, Auranen K, et al. (2008) Social Contacts and Mixing Patterns Relevant to the Spread of Infectious Diseases. PLOS Medicine 5(3): e74. https://doi.org/10.1371/journal.pmed.0050074
[2] Jewell C, Roberts GO, (2012) Enhancing Bayesian risk prediction for epidemics using contact tracing. Biostatistics 13(4):567–579. https://doi.org/10.1093/biostatistics/kxs012
-
Novel inference methodologies for compartmental models of infectious diseases
Eduard Campillo-Funollet, Lloyd Chapman
Compartmental models play a crucial role in the modelling of infectious diseases. To apply these models in practice, we need robust inference schemes that allow us to obtain estimates for the model parameters based on the available data. The goal of this project is to develop novel inference methodologies (1) for Susceptible-Exposed-Infectious-Removed type models (SEIR), including the identifiability analysis and the application to global datasets on infectious diseases.
A successful candidate will need to be familiar with at least one programming language (e.g. R, Python) and be familiar with compartmental models for infectious diseases.
- Campillo-Funollet E, Wragg H, Van Yperen J, Duong DL, Madzvamuse A. Reformulating the susceptible-infectious-removed model in terms of the number of detected cases: well-posedness of the observational model. Philos Trans A Math Phys Eng Sci. 2022 Oct 3;380(2233):20210306.
Medical Statistics
Medstats accordion Accordion
-
Maximum Tolerated Dose in phase I oncology drug dose finding studies
Phase I dose finding studies in oncology aim to find the Maximum Tolerated Dose (MTD) of a therapy. In such a study, cohorts of patients receive escalating doses, and allocation of doses and the final MTD recommendation are based on the observations of toxicities and activity markers. Numerous different approaches are available to determine this allocation and MTD, of varying complexities. A recent initiative of the FDA, “Project Optimus*”, is to make more informed decisions at earlier stages of the drug development process in oncology. The inclusion of pharmacokinetic (PK) data, that is data on the exposure of the drug, into trials is a useful way to make the decisions more informed, however the implementation is not straightforward. As well as the timing of the availability of such PK data, the variability of the relationship between dose, exposure and toxicity/activity must be considered, in addition to accounting for multiple cycles of treatment.
*https://www.fda.gov/about-fda/oncology-center-excellence/project-optimus
-
Adaptive enrichment designs using machine learning techniques for subgroup identification
Modern drug development is increasingly focused on identifying the most promising subpopulation to respond to a treatment, rather than a `one-size fits all’ approach. Adaptive enrichment designs have been developed for this purpose, but currently mainly focus on predefined subgroups or regression-based methods. The project will investigate the application of machine learning classification techniques, e.g. support vector machines; random forests; neural networks, to aid the identification of the promising subpopulation, while still ensuring family-wise error control.
-
Event history analysis of genetic family data
Studies into rare genetic disorders often involve first identifying carriers of gene mutations who typically display symptoms (referred to as probands) and then screening relatives to find further carriers. Retrospective medical history can then be used to try to understand the natural history of the disorder. However, the sampling method greatly complicates statistical inference. Individual families may have higher or lower propensity to develop diseases/symptoms but those with more affected members are more likely to be sampled. The retrospective nature also induces form of left-truncation since a person can typically only be tested for gene mutations if they are still alive. Motivated by studies of patients with rare-tumour risk syndromes, the PhD project will look to develop methods for fitting event history analysis models, such as multi-state models, for genetic family data with known or unknown ascertainment methods.
Environmental and Ecological Statistics
Go to Environmental and Ecological Statistics research group
Environmental accordion Accordion
-
Forecasting the ecological impacts of climate change
Emma Eastoe, Rachel McCrea, Susan Jarvis (UK Centre for Ecology and Hydrology)
Over the last two decades, the quantity and quality of climate data has increased massively. Data come from many sources including in-situ measurements, remote sensing (satellites) and climate model output. Such data can be used to quantify changes to the Earth's climate, and to predict how these changes might play out over the decades to come. Whilst identification of climate trends is undoubtedly useful in and of itself, it is even more vital that this data can be used to help understand, and mitigate against, the consequences of climate change. Currently, climate variables are treated as covariates in ecological models [Davies et al, 2023; Ferguson et al, 2008; Jacobsen et al, 2004]. There are several major limitations to this approach: (1) uncertainty in the climate variables is not accounted for when fitting the model or making predictions, and (2) the resulting models do not provide trustworthy predictions for future climate scenarios. Uncertainty in the climate variable is especially important when using climate model output where uncertainty arises from, amongst other things, measurement error, bias, imperfect replication of real world systems. Treating such output as exactly observed covariates will lead to over-precision in predicting climate errors. The objective for this project is to develop a joint statistical model for climate and ecological data that will allow these limitations to be addressed. Major challenges will be: selecting and merging appropriate data sets with data measured at different spatio-temporal resolutions and to varying degrees of accuracy, handling missing values and downscaling of climate model output.
References:
Davies, S.C., Thompson, P.L., Gomez, C., Nephin, J., Knudby, A., Park, A.E., Friesen, S.K., Pollock, L.J., Rubidge, E.M., Anderson, S.C. and Iacarella, J.C., "Addressing uncertainty when projecting marine species' distributions under climate change". Ecography, (2023) p.e06731.
Ferguson, C. A., L. Carvalho, E. M. Scott, A. W. Bowman, and A. Kirika. "Assessing ecological responses to environmental change using statistical models." Journal of Applied Ecology 45, no. 1 (2008): 193-203.
Jacobson, A.R., Provenzale, A., von Hardenberg, A., Bassano, B. and Festa-Bianchet, M. “Climate forcing and density dependence in a Mountain Ungulate Population.” Ecology 85, no. 6 (2004): 1598-1610
-
Integrated modelling of random walk data
Sheep preferences for different types of grass can lead to non-homogeneous grazing patterns. Understanding the grazing patterns is of critical importance for conservation, but often the available data is limited to GPS tracking of a few sheep, and data from camera traps. This poses a statistical challenge: how can we integrate both datasets in one model to infer the population density at different locations?
The goal of this project is to create new inference methodologies for time-dependent population densities, based on track data---realisations of a random-walk---and camera trap data---passage times at fixed locations.
A successful candidate will need to be familiar with at least one programming language (e.g. R, Python) and random walks (e.g. Brownian motion).
Extreme Value Statistics
Extreme accordion Accordion
-
Geometric approaches to modelling multivariate extremes
Modelling extremes of multiple random variables is an intricate task, as the dependence assumptions on the data will strongly influence extrapolations from our models. This can change the estimated probability of certain extreme events occurring by orders of magnitude, which has a clear impact on risk assessment. There are different approaches to modelling multivariate extremes, but modelling in moderate-high dimensions, while allowing for realistic dependence structures, still represents a challenge. A new framework, based on a so-called geometric representation of multivariate extremes, appears promising for opening up higher dimensional analysis. This PhD project will develop novel methodology to help realise the potential of this exciting new approach.
-
Structured statistical modeling approaches to flood prediction
Emma Eastoe, Israel Martinez Hernandez
Each year flooding in the UK causes disruption to local communities and the economy. Protection of people, homes, businesses and infrastructure in locations vulnerable to flooding requires accurate predictions of flood risk. For many years, the industry standard was to fit a statistical model to historical river flow measurements at each location, with predictions based on an extrapolation from this model. The models used tended to be overly simplistic and unable to capture important process features such as inter-year variability, long term trends, impacts of land use change and spatio-temporal dependence. Using data from the UK National River Flow archive, this project will investigate ways to improve flood risk predictions by modeling within- and between event temporal dependence. The objectives of this project are to develop
- Statistical predictions of flood event profiles by combining functional data analysis with extreme value theory.
- Statistical models to describe the clustering of flood events, such as those seen in the North of the UK as a consequence of Storms Desmond, Eva and Frank in December 2015/January 2016 and the recent Storm Babet (October 2023).
- A multivariate approach to capture the joint risk of fluvial (river) and coastal flooding for vulnerable regions in the UK.
The project will suit anyone with a Master's level understanding of statistical modelling including generalised linear models, mixed effects models and either time series analysis or geostatistics. You should be confident with at least one of likelihood and Bayesian inference, and undergraduate-level multivariate probability.
References
Eastoe, E. (2019). Nonstationarity in peaks‐over‐threshold river flows: A regional random effects model. Environmetrics, 30(5), e2560.
Heffernan, J. E., & Tawn, J. A. (2004). A conditional approach for multivariate extreme values (with discussion). Journal of the Royal Statistical Society Series B: Statistical Methodology, 66(3), 497-546.
Keef, C., Tawn, J., & Svensson, C. (2009). Spatial risk assessment for extreme river flows. Journal of the Royal Statistical Society Series C: Applied Statistics, 58(5), 601-618.
Keef, C., Tawn, J. A., & Lamb, R. (2013). Estimating the probability of widespread flood events. Environmetrics, 24(1), 13-21.
Martinez-Hernandez, I., & Genton, M. (2023). Surface time series models for large spatio-temporal datasets. Spatial Statistics, 53.
Martinez-Hernandez, I. & Genton, M. (2021). Nonparametric trend estimation in functional time series with application to annual mortality rates. Biometrics, 77(3).
Winter, H. C., & Tawn, J. A. (2017). k th-order Markov extremal models for assessing heatwave risks. Extremes, 20, 393-415.
Social Statistics
Social accordion Accordion
-
Latent variable models for social science data
The social and behavioural sciences pay great interest to variables that are inherently difficult to measure well. It may for example relate to the mental health status of patients under treatment, proficiency levels of students, or political attitudes of voters. Psychological questionnaires, educational tests, and social surveys are examples of instruments that often are constructed with the goal of learning about an underlying, latent phenomenon. With the increasing heterogeneity and size of social science data, there is a great need to develop more flexible, robust, and computationally efficient statistical methods to ensure the validity and fairness of educational and psychological measurement. Students interested in developing new methodology and theory for latent variable models are welcomed, where the aim is to model the underlying response processes more closely, to improve the measurement of the latent attributes and to gain further insight into the response processes.
Statistical Learning
Learning accordion Accordion
-
Dimension Reduction and Regularisation for Spectral Clustering
The group of methods referred to broadly as “Spectral Clustering” (SC) has become one of the most influential classes of data clustering methods in the recent past. Spectral clustering relies on the spectral decomposition of normalised data similarity matrices, to approximate some NP hard graph clustering problems. The popularity of SC methods is due to their flexibility and broad applicability, as well as some interesting theoretical connections to non-parametric clustering in the statistical context. As with all highly flexible models, however, they are inherently sensitive to poor hyperparameter tuning and can lack interpretability. Interpretability of clustering models is increasingly important, as “explainability” and “fairness” have become fixtures in the statistical learning vocabulary. This project will develop new methodology for regularised spectral clustering, in which regularisation is achieved through purpose specific penalties on the eigenvalue objective which is maximised in standard SC problem(s). Some important examples of such “purposes” include (i) balance across categories (fairness); (ii) sparse dimension reduction (interpretability) and (iii) merging of different data types (multi-view clustering).
-
Active drifter deployment
Drifters are free-floating measurement devices that are released into the ocean. They move with the currents, and provide an alternative source of data to fixed measurement devices (eg anchored buoys). This project is not about techniques to analyse such data. It is about deciding where and when to deploy drifters to get the most useful information. You will devise techniques to design drifter campaigns by combining techniques such as Gaussian process emulation, Bayesian optimisation, and experimental design. You will work with ocean simulators (supplied by collaborators) to develop your techniques, and will have opportunity to collaborate directly with potential users of your methods. The project will build on fundamentals of Bayesian statistics, spatial statistics, experimental design, numerical optimisation, high performance computing, information theory; successful candidates should be already familiar with at least some of these areas.
-
Bandits in real systems
Multi-armed bandit theory is extremely well-studied in situations where there is a very direct link between actions and rewards. However in many situations where we may wish to deploy these techniques, the choice of an action leads to outcomes in a complex and partially-understood way. For example, choosing the price of a finitely-available product for the following day will result in a semi-predictable sales pattern, and consequent amount of stock left at the end of the day. And choosing some hyper-parameters of a learning method for a period of time will result in a semi-predictable performance improvement of the method. This project will develop techniques for such problems, where there is a (semi-)parameterised model of the world, and sequential decisions must be taken to simultaneously learn the model and optimise outcomes.
-
Bandit Algorithms with Non-trivial Inference Mechanisms
Multi-armed bandit models are the most fundamental models within reinforcement learning, they capture the challenge of learning to choose the optimal among a set of actions, through repeated experimentation under uncertainty. Solutions to these problems combine statistical inference, optimisation and insights from probability theory. There is a rich theory surrounding optimal algorithms for the problem when the statistical inference task is relatively straightforward (e.g. estimating independent parameters, or the effects in a generalized linear model) but there remains uncertainty around how to mesh this theory with more challenging estimation procedures, e.g. where inference is necessarily numerical and approximate. A student with a strong interest in statistical theory, as well as willingness to explore machine learning and optimisation would be a good fit for this project.
Time Series
Timeseries accordion Accordion
-
Time series models for spatio-temporal data sets
When data corresponds to a specific location (space) and period (time), it is essential to include the effect of these components in the statistical model. We can address many relevant scientific questions by accounting for these two components. For instance, if we can predict a pollutant level in areas of a city without a monitoring station. Or how past pollution levels will influence the pollution levels in a different city region. Although we can model the effect of time and space separately, the resulting model will be a simplified version of a more complex system, which is barely realistic. Additionally, predictions can be more accurate by modelling both components jointly. This research proposes new models incorporating a more substantial spatial component in time series models.
-
Implementation and Methodologies for large and complex Datasets
Due to the rapid development of complex, performant technologies, data can now be collected on a large scale, resulting in high-dimensional and high-frequency data, sometimes necessitating high-performance computing, which is often a limitation for practitioners. Therefore, the development of new statistical models is being pushed forward. In the context of large datasets, it is natural to assume that data will have complex (spatial and/or temporal) dependencies and long-term trends, and classical statistical models are not enough to describe these complex structures. A new paradigm of data analysis assumes that observations are complex objects, such as continuous functions/surfaces, instead of numbers. This new approach is known as functional data analysis. Some examples of this type of data are blood flow, flood profiles, sequence of satellite images, handwriting, etc.
This project is about developing new statistical models for large and complex datasets using a functional data approach. It will focus on nonstationary data with temporal and/or spatial dependence. Some specific topics are:
- Functional factor models with applications to environmental data or neuroscience
- Functional time series models for a sequence of complex curves with application to hurricane trajectories.
- Any student with a master's level in statistics is welcome to apply.
References:
- Martinez-Hernandez, I., & Genton, M. (2023). Surface time series models for large spatio-temporal datasets. Spatial Statistics, 53.
- Martinez-Hernandez, I., & Genton, M. (2020). Recent developments in complex and spatially correlated functional data. Brazilian Journal of Probability and Statistics, 34(2).
- Martinez-Hernandez, I., & Gonzalo, J., & González-Farías, G. (2022). Nonparametric estimation of functional dynamic factor model. Journal of Nonparametric Statistics, 34(4) Methods and algorithms for optimal representation of continuous curves (1D and 2D) with B-splines.
-
Computing robust estimators based on ranks
Background: Consider a linear regression model, for simplicity. A major branch of nonparametric statistics deals with the rank-estimation (R-estimation) of regression parameters by minimising certain deviations based on the ranks of the residuals in the model. Although there had been unprecedented development on the theoretical study of R-estimators in linear models during last six decades or more, unfortunately their computation is a challenging and long-standing problem restricting their practical applications. We have proposed some iterative algorithms that can be applied routinely to compute R-estimators based on any score function in the linear regression and autoregressive model. In fact, because of its computational simplicity, various avenues have opened up to use R-estimators as one of the most competitive robust class of estimators in various statistical models.
Intended activity: We would like to investigate convergence properties of algorithms for computing R-estimators. We will also study the usefulness and applications of R-estimators to various linear and nonlinear models through simulations and data analysis. Although R-estimators are asymptotically normal, initial simulation study revealed that their finite sample distributions can be asymmetric. Hence it would be natural to explore the effectiveness of bootstrap procedures to approximate the finite sample distributions of various R-estimators.