Assessing the effect of air pollution on respiratory health using drug prescriptions
In this work we study the association between exposure to outdoor pollution and drug prescriptions for respiratory diseases in England. Drug prescribing data are obtained through the English National Health Service for the period from August 2010 to December 2012 for all general practices in England. The dataset includes around 11′000 practices and 30′000 drugs with about 112 million drug items prescribed each month. Moreover, the following additional time-invariant variables are available at the practice level: address, postcode, number of patients registered, patient age distribution, asthma prevalence, pulmonary disease prevalence and multiple deprivation index.
Air pollutant concentration data are available from two sources: monthly measurements (same period as drug prescriptions) are obtained from the European Air Quality Database (AIRBASE), while mean concentration for 2009 is available through the ADMS numerical model for lower super output areas in England. The two data sources are integrated through a space-time model using SPDE to predict concentration values at the lower super output area level for each month. Temperature is also included in the model as an additional covariate.
Then the predicted concentration is linked to the drug prescriptions in a spatio-temporal hierarchical framework using the INLA approach.
Joint work with Michela Cameletti
Some cases studies and challenges with spatial data
We discuss three problems with spatial data, all collected at Paraná State, Brazil. Two of them are related to public health issues with counts of infant mortality and leprosis recorded at municipalities within the state. Results of preliminary analysis are shown, aiming to identify high risk regions, spatial patterns and possible environmental and socio-economic related factors. Analysis include a proposal to build covariate based neighboring matrices. The third problem aims to identify possible influence effects and gradients from a highly irregular areal source on a measured variable across the study region.
Placating pugilistic pachyderms: proper priors prevent poor performance
Modern statistical models are hard. Jimmie Savage suggested that we “build our models as big as elephants”, while J. Bertrand told us to “give [him] four parameters and [he] shall describe an elephant; with five, it will wave its trunk”. The modern practice of Bayesian statistics can be seen as a battle between these two elephants. In this talk, I will outline the concept of Penalised Complexity (PC) priors, which are our attempt to turn this into a fair fight.
Turning it up to eleven
The aim of the EUSTACE project is to compute global reconstructions of daily temperatures since 1850, on a quarter degree spatial grid. The associated latent spatio-temporal models have, in principle, one million spatial and 60 thousand temporal degrees of freedom. Each data source has its own observation model, sometimes with persistent random effects that make sequential analysis difficult. I will discuss some of the challenges in scaling the calculations for multiscale latent stochastic PDE models in the resulting non-linear least squares problem, with emphasis on how to extract uncertainty information.
Time series modeling of pathogen-specific disease probabilities with incomplete data
In this talk I will discuss avoiding the need for an auxiliary variable MCMC scheme via the use of an approximate likelihood. An empirical Bayes scheme is used to impute missing data and a relative risk, with associated standard error, is constructed. This relative risk is then modeled using a semi-parametric model, as a function of meteorological covariates, with confounding by time modeled using a RW2 model. This is joint work with Leigh Fisher.
Mark van de Wiel:
Better prediction by use of co-data: Adaptive group-regularized ridge regression
For high-dimensional settings, we show how one can use empirical Bayes principles to estimate penalties that may differ across groups of variables. These groups are predefined using co-data, which is auxiliary information available on the variables (e.g. genomic annotation or external p-values). Due to the adaptive character of the penalties, the group-wise penalties may improve predictions when the groups are indeed informative, while not detoriating those when this is not the case. We provide an implementation in a classical logistic ridge regression setting; However, I will also discuss extension of the framework to a Bayesian ridge regression setting using INLA. The latter is particularly useful for obtaining credibility intervals on the predicted event probabilities. I will discuss some preliminary results which show that highest-probability density intervals seem to have fairly good coverage when the number of variables is not extremely large. I will illustrate results with simulations and with some examples on classification using cancer genomics data.
Multivariate Distributional Regression
Distributional regression provides a unified framework for structured additive distributional regression with responses from a variery of continuous, discrete and latent multivariate response distributions, where each parameter of these potentially complex distributions is related to a structured additive predictor. The latter is an additive composition of different types of covariate effects e.g. nonlinear effects of continuous covariates, random effects, spatial effects, or interaction effects. Inference is realised by a generic, computationally efficient Markov chain Monte Carlo algorithm based on iteratively weighted least squares approximations and with multivariate Gaussian priors to enforce specific properties of functional effects.
Additive Mixed Models for Generalized Functional Data
We propose and evaluate an extensive framework for additive regression models for correlated functional responses from exponential families whose conditional expectation varies smoothly over the functions' arguments. Our proposal allows for multiple partially nested or crossed functional random effects with flexible correlation structures as well as linear and nonlinear effects of functional and scalar covariates that may vary smoothly over the argument of the functional response. It accommodates densely or sparsely or irregularly observed functional responses and predictors which may be observed with additional error and includes both spline-based and functional principal component-based terms. Estimation and inference in this framework is based on standard generalized additive mixed models, allowing us to take advantage of established methods and robust, flexible algorithms.
Adaptive prior weighting in generalized regression
The prior distribution is a key ingredient in Bayesian inference. Prior information on regression coefficients may come from different sources and may or may not be in conflict with the observed data. Various methods have been proposed to quantify a potential prior-data conflict, such as Box's $p$-value. However, the literature is sparse on methodology what to do if the prior is not compatible with the observed data. To this end, we review and extend methods to adaptively weight the prior distribution. We relate empirical Bayes estimates of prior weight to Box's $p$-value and propose alternative fully Bayesian approaches. Prior weighting can be done for the joint prior distribution of the regression coefficients or - under prior independence - separately for each regression coefficient or for pre-specified blocks of regression coefficients. We outline how the proposed methodology can be implemented using integrated nested Laplace approximations (INLA) and illustrate the applicability with a Bayesian logistic regression model for data from a cross-sectional study and a Bayesian analysis of binary longitudinal data from a randomized clinical trial using a generalized linear mixed model.
Frequentist inference with Gaussian Markov Random Fields
Consider a GMRF $U$ with precision matrix $Q$, and $Y = X beta + U + Z$ with $Z$ being independent Gaussian noise. Bayesian inference is an attractive approach for this model as the full conditional distributions of each $U_i$ can be derived easily. Frequentist inference requires a likelihood function based on the marginal distribution of $Y$, integrating out $U$, which does not share the same sparseness properties of the conditional distributions. Judicious use of matrix inversion identities and efficient coding with a sparse matrix library yield a fast and memory-sparing algorithm even when the dimensionality is large. The talk will demonstrate the algorithm in conjunction with the GMRF approximation of the Matern covariance for gridded data, assessing the accuracy of the GMRF approximation, the effect of edge corrections, and the impact of missing observations.
Recent Advances in Sparse Bayesian Factor Analysis
Factor analysis is a popular method to obtain a sparse representation of the covariance matrix of multivariate observations. The present talk reviews some recent research in the area of sparse Bayesian factor analysis that tries to achieve additional sparsity in a factor model through the use of point mass mixture priors. Identifiability issues that arise from introducing zeros in the factor loading matrix are discussed in detail. Common prior choices in Bayesian factor analysis are reviewed and MCMC estimation is briefly outlined. Applications to a small-sized psychological data set as well as a financial application to exchange rate data serve as an illustration (joint work with Hedibert Lopes).
Spatial point processes in the modern world – the need for an interdisciplinary dialogue
In the past, complex statistical methods beyond those covered in standard statistics textbooks would be developed as well as applied by a statistician. Nowadays, freely available, sophisticated software packages such as R are in common use and at the same time increasing amounts of data are collected. As a result, users have both, a stronger need for analysing these data themselves as well as an increasing awareness of the existence of the advanced methodology since it is no longer “hidden” from them in inaccessible statistical journals. As a result, statisticians make their methodology usable.
In this talk, we argue that is necessary to make methods usable and for this to be successful there needs to be a strong interaction with the user community through interdisciplinary work. This implies not only making model fitting feasible by developing computationally efficient methodology to reduce running times but also to improve the practicality of other aspects of the statistical analysis such as model construction, prior choice and interpretation as these equally relevant for users with real data sets and real scientific questions. We discuss the importance of an intense interdisciplinary dialogue for statistics to become relevant in the real world by illustrating it through discussing past and current examples of this ongoing dialogue in the context of spatial point processes and their application – mainly in the context of ecological research.
Joint work with David FRP Burslem