ABC in Paris 

Université Paris Dauphine,

June 26, 2009

We are organising a one-day meeting on recent advances on ABC methods in Paris, at Université Paris Dauphine, next June 26, 2009, as the final step of our ANR 2005-2008 Misgepop project, with financial support from Université Paris Dauphine (BQR) and from the GIS "Sciences de la Décision" X-HEC-ENSAE. There have been so many advances in this area in the past year or so that a single day is obviously too short to cover the whole field, but it should nonetheless put the highlight on those advances and bring the (local) communities together. Further and longer meetings may also stem from that one.

The program of the workshop is

Approximate Bayesian computation (ABC) is a popular approach to address inference problems where the likelihood function is intractable, or expensive to calculate. To improve over Markov chain Monte Carlo (MCMC) implementations of ABC, the use of sequential Monte Carlo (SMC) methods has recently been suggested. Effective SMC algorithms that are currently available for ABC have a computational complexity that is quadratic in the number of Monte Carlo samples and require the careful choice of simulation parameters. In this article an adaptive SMC algorithm is proposed which admits a computational complexity that is linear in the number of samples and determines on-the-fly the simulation parameters. We demonstrate our algorithm on both a toy and a population genetics example.
Approximate Bayesian Computation (ABC) methods can be used in situations where the evaluation of the likelihood is computationally prohibitive. They are thus ideally suited for the analysis of complex dynamical systems (Toni et al. 2009), where knowledge of the full (approximate) posterior is often essential. Here we discuss improvements to an ABC approach, which is based on sequential Monte Carlo (SMC). We are particularly interested in applying ABC SMC to the increasingly important model selection problem. We will discuss how ABC SMC can be adapted for model selection for dynamical systems given a set of candidate models. In particular we will discuss how we can balance the "fit" to the data with the complexity of the simulation model. Being based on repeated simulation, ABC SMC is computationally expensive for models with many parameters (such as those considered in systems biology). We present an exploration of different perturbation kernels, which can improve the computational efficiency by exploring large-dimensional parameter spaces, yet still allow us to address the issue of maintaining particle diversity to obtain good approximations to the posterior distribution.
In this talk we will discuss two representations of the target distribution within an ABC context (relating to a marginal and joint space representation of the target distribution). We will also discuss the bias arguments related to the paper by Sisson et al (2007). We will establish a set of unbiased ABC-SMC based algorithms, and finally provide an application.
A key innovation to ABC was the use of a post-sampling regression adjustment, allowing larger tolerance values and as such shifting computation time to realistic orders of magnitude (Beaumont et al.). In my talk I propose a reformulation of the regression adjustment in terms of a General Linear Model (GLM). This allows a natural integration into the theoretical framework of Bayesian statistics and the use of its methods, including model selection via Bayes factors. As an illustration, the proposed methodology is applied to the question of population subdivision among western chimpanzees.
 In many areas of computational biology, the likelihood of a scientific model is intractable, typically because interesting models are highly complex. This hampers scientific progress in terms of iterative data acquisition, parameter inference, model checking and model refinement within a Bayesian framework. We provide a statistical interpretation to current developments in likelihood-free Bayesian inference that explicitly accounts for discrepancies between the model and the data, termed Approximate Bayesian Computation under model uncertainty (ABCµ) (1). We augment the likelihood of the data with unknown error terms that correspond to freely chosen checking functions, and describe possible Monte Carlo strategies for sampling from the associated joint posterior distribution without the need of evaluating the likelihood. We discuss the benefit of incorporating model diagnostics within an ABC framework, and demonstrate how this method diagnoses model mismatch and guides model refinement by contrasting three qualitative models of protein network evolution to the protein interaction datasets of Helicobacter pylori and Treponema pallidum. The presented methods will be useful in the initial stages of model and data exploration, and in particular to efficiently scrutinize several models for which the likelihood is intractable by direct inspection of their summary errors, prior to more formal analyses.
The core idea is that, for Gibbs random fields and in particular for Ising models, when comparing several neighbourhood structures, the computation of the posterior probabilities of the models under competition can be operated by likelihood-free simulation techniques (ABC). The turning point for this resolution is that, due to the specific structure of Gibbs random field distributions, there exists a sufficient statistic across models which allows for an exact (rather than Approximate) simulation from the posterior probabilities of the models. Obviously, when the structures grow more complex, it becomes necessary to introduce a true ABC step with a tolerance threshold \mathbf{\epsilon} in order to avoid running the algorithm for too long. Our toy example shows that the accuracy of the approximation of the Bayes factor can be greatly improved by resorting to the original ABC approach, since it allows for the inclusion of many more simulations. In a biophysical application to the choice of a folding structure for two proteins, we also demonstrate that we can implement the ABC solution on realistic datasets and, in the examples processed there, that the Bayes factors allow for a ranking more standard methods (FROST, TM-score) do not.
Recently Joyce and Marjoram ("Approximately sufficient statistics and Bayesian computation", Stat. Appl. Genet. Mol. Biol. 7(1):26, 2008) developed a sequential scheme for selecting the best subset of summary statistics to use in ABC, given a set of candidate summary statistics. Their approach was based on a notion of approximate sufficiency. We will report the results of our investigation seeking ways to improve on their scheme, using Kullback-Leibler divergence.
We will look at how simulation can be used to produce informative summary statistics within ABC. The issue will be investigated both theoretically and via simulation, including comparisons with examples of ABC taken from the literature.
Recently a group of techniques, variously called likelihood-free inference, or Approximate Bayesian Computation (ABC), have been quite widely applied in population genetics. These methods typically require the data to be compressed into summary statistics. In a hierarchical setting one may be interested both in hyper-parameters and parameters, and there may be very many of the latter - for example, in a genetic model, these may be parameters describing each of many loci or populations.  This poses a problem for ABC in that one then requires summary statistics for each locus, and,  if used naively, a consequent problem in conditional density estimation.  We develop a general method for addressing these problems efficiently, and we describe recent work in which the ABC method can be used to detect loci under local selection.
We present Approximate Bayesian Computation as a technique of inference that relies on stochastic simulations and non-parametric statistics. For both the original estimator of the posterior distribution based on kernel smoothing and a refined version of the estimator based on a linear adjustment, we give their asymptotic bias and variance. Additionally, we introduce an original estimator of the posterior distribution based on quadratic adjustment and we show that its bias contains a smaller number of terms compared to the estimator with linear adjustment. Although, we find that the estimators with adjustment are not universally superior to the estimator based on kernel smoothing, we find that they can achieve better performance when there is a nearly homoscedastic relationship between the summary statistics and the parameter. Last, we show that both asymptotic results and numerical simulations emphasize the importance of the curse of dimensionality in Approximate Bayesian Computation.
The approximation error in ABC algorithms can be understood by the consideration of an additive error term, where the distribution of this error can be inferred from the choice of metric and acceptance kernel. Once we are aware of this we can begin to think more carefully about what model error we expect for our models, and consequently what metric, tolerance and summaries we would ideally use. There may also be the opportunity to rewrite some models so that sampling can be done by the ABC rejection step, thus raising the possibility of exact inference in some cases.
Approximate Bayesian inference on the basis of summary statistics is well-suited to complex problems for which the likelihood is either mathematically or computationally intractable. However the methods that use rejection suffer from the curse of dimensionality when the number of summary statistics is increased. Here we propose a machine-learning approach to the estimation of the posterior density by introducing two innovations. The new method fits a nonlinear conditional heteroscedastic regression model on the summary statistics by using a penalized least-squares method, and then adaptively improves estimation by using importance sampling. We also investigate the choice of the regularization parameter and the tolerance rate in ABC algorithm with a version of the Deviance Information Criterion. The new algorithm is compared to the state-of-the-art approximate Bayesian methods, and achieves considerable reduction of the computational burden in two examples of inference in statistical genetics and in a queueing model.
DIYABC is a computer program with a graphical user interface and a fully click-able environment. It allows population biologists to make inference based on Approximate Bayesian Computation (ABC), in which scenarios can be customized by the user to fit many complex situations involving any number of populations and samples. Such scenarios involve any combination of population divergences, admixtures and population size changes. DIYABC can be used to compare competing scenarios, estimate parameters for one or more scenarios, and compute bias and precision measures for a given scenario and known values of parameters (the current version applies to unlinked microsatellite data).

This definitely is a marathonian schedule (!), but it should allow for attendees from France or nearby countries to make the round trip within the same day (if needed). 

This meeting is free, with no registration, and open to anyone interested. The talks will take place in Amphitheater 2-3 of Université Paris Dauphine, located on the second floor of the (unique) university building. Université Paris Dauphine is located in downtown Paris (Porte Dauphine) and is accessible by metro (e.g., stops Porte Dauphine, or Avenue Foch) as explained there.

Contact Christian Robert at bayesianstatistics[(à)] for further practical information (but the programme is now complete, no more talks, sorry!)

Universite Paris Dauphine GIS