\sc Ten Years, or so, of MCMC

Ten years, and still running!

Olivier Cappé and Christian P. Robert
ENST, Paris, and CREST-INSEE, Paris, and Université de Rouen
`cappe@sig.enst.fr robert@ensae.fr`

Draft 1.0 - Comments welcome!

Iteration 0: Burnin' steps

While MCMC techniques can be traced back to the early fifties and are almost contemporary with independent Monte Carlo methods, the use of these techniques has only really taken off in the past ten years, at least in ``mainstream" Statistics, with Gelfand and Smith (1990) and the re-discovery of Metropolis' and Hastings' papers. The impact on the discipline is deep and durable, because these methods have opened new horizons in the scale of the problems one can deal with, thus enhancing the position of Statistics in most applied fields. The MCMC ``revolution" has in particular boosted Bayesian Statistics to new heights by providing a virtually universal tool to deal with integration (and optimisation) problems: this can be seen, for instance, in the explosion of papers dealing with complex models, hierarchical modelings, nonparametric Bayesian estimation, or spatial Statistics. This trend has also created new synergies with mathematicians and probabilists, as well as econometricians, engineers, ecologists, astronomers and others, for, respectively, theoretical requests and practical implications of MCMC techniques.

It is impossible to provide here a complete perspective on MCMC methods and applications. For this as well as references, the reader should refer to the books by Gilks et al. (1995) and by Robert and Casella (1999). We focus on four particular aspects of MCMC algorithms: theoretical foundations, practical implementation, Bayesian inference and prospects.

Virtual aspects.

One can find on the Web many relevant sites, among which the MCMC preprint server,¹ which offers a most useful window to the current research on MCMC, Wilson's site on perfect simulation,² keeping track of papers and applets on this topic, and a site on convergence control techniques.³ Even more recent developments allow for on-line computations through applets, like Breyer's,⁴ which impressively illustrates the behaviour of different samplers for simple problems. Note also that the BUGS and CODA softwares are, so far, freely available on the Web.⁵

Iteration 1: It does converge!

The idea at the core of MCMC techniques is, a posteriori, quite simple: run a Markov chain (q^(t)) such that (a) (q^(t)) is irreducible, (b) (q^(t)) is ergodic, and (c) the stationary distribution of (q^(t)) is the distribution of interest, p. Once the chain is on the move, all we need to do is to check for convergence to this distribution, since this should theoretically happen by virtue of the ergodic theorem. Moreover, integrals of interest can be approximated as in independent Monte Carlo studies since

lim
T®¥

T
å
t = 1

h(q^(t)) =

ó
õ

h(q) dp(q) .

This follows from basic Markov chain theory, but the formidable appeal of MCMC methods is that there exist universal schemes to implement this idea: besides the Gibbs sampler (see xxx in this volume), the Metropolis-Hastings algorithm allows one to pick almost any conditional density q(·|q), and to derive a transition kernel (that is, the conditional density of the Markov chain) as

K(q^(t),q^(t+1))

q(q^(t+1)|q^(t))

min

æ
ç
è

1 ,

q(q^(t)|q^(t+1)) p(q^(t+1))

q(q^(t+1)|q^(t)) p(q^(t))

ö
÷
ø

d_q^_(t)(q^(t+1))

ó
õ

max

æ
ç
è

0 , 1 -

q(q^(t)|x) p(x)

q(x|q^(t)) p(q^(t))

ö
÷
ø

q(x|q^(t)) dx .

The distribution p is then stationary under the transition K since the (detailed) balance condition holds,

p(q^(t)) K(q^(t),q^(t+1)) = p(q^(t+1)) K(q^(t+1),q^(t)) ,

and the chain (q^(t)) is irreductible and ergodic in almost every setting.

It could converge faster...

In some cases, ergodicity may simply be not enough to guarantee convergence. The work in recent years has thus focussed on more advanced properties of MCMC samplers. For instance, a property of interest is geometric ergodicity, that is, the geometric decrease (in t) of the total variation distance between the distribution at time t and p. Mengersen and Tweedie (1996) established fairly complete conditions of geometric ergodicity when q(x|q) = g(x) [independent samplers] and q(x|q) = f(x-q) [random walk samplers]. Uniform ergodicity is an even stronger type of convergence, in the sense that the constant C involved in the geometric decrease C r^t (0 < r < 1) does not depend on the starting value q⁽⁰⁾. This property cannot hold for random walks on non-compact sets, in general, but can be exhibited for some Gibbs and independent samplers.

This type of research has practical bearings, moreover, since it allows to compare and rank samplers in specific cases (geometric vs. non-geometric for instance). In addition, a considerable amount of work has gone in the estimation of the geometric constants C and r. The fact that an exact rate is available for some models of interest is obviously quite a major advance, since it means that precise information is thus available on the time required by the algorithm to converge. From another point of view, it also allows to reject some sampling schemes. So, what appears at first as a theoretical game ends up after a few years and a considerable amount of work by Roberts and co-authors (Roberts and Rosenthal, 1998, 1999, Roberts and Tweedie, 1999, Rosenthal, 1994) as a valuable tool for practical implementation.

...by scaling the samplers, ...

Metropolis-Hastings algorithms, like their accept-reject counterparts, and unlike the Gibbs sampler, have an average probability r > 0 to reject the proposed values. Contrary to the accept-reject setting, though, a high value of r is not always a desirable feature for random walk samplers, since they indicate a propensity to avoid the tails of p. Gelman et al. (1996) have characterised more quantitatively this feature by deriving optimal acceptance rates, which should get close to 0.234 for high dimensional Gaussian models. Similar developments by Roberts and Rosenthal (1998) show that this limiting rate is close to 0.5 for Langevin algorithms.

...or taking advantage of the posterior, ...

The random walk proposal is somehow the obvious choice and is used in the early works of Metropolis and Hastings. Despite (or because of?) its universality, it does not always enjoy satisfactory convergence properties, in particular in large dimensions. Even though reparameterisation and optimal scaling of the proposal may often improve its performance, the random walk is somehow a shot in the dark.

An improvement that takes advantage of the fact that p is known up to a constant, while keeping the universality feature, is based on discretisation of the Langevin diffusion

dX_t = (1/2) Ñlogp(X_t) dt + dW_t ,

which is associated with the stationnary distribution p. The discretised version, which defines a new transition q, is x_t+1 = q^(t) + (s²/2) Ñlogp(q^(t)) + se_t . It thus adds a shift term (s²/2) Ñlogp(q^(t)) to the random walk, which ensures that the moves are more likely directed towards regions with high values of p, than towards zones of low values of p. This may be detrimental in multimodal setups, since it pulls the chain back towards the nearest mode, but it has been observed to often speed up mixing and excursions on the surface of p. This practical observation is also backed by theoretical developments on the superior convergence properties of Langevin MCMC algorithms. A last word about the universality of this scheme is that it requires very little additional programming, compared with random walk, since the computation of the gradient can be delegated to a numerical procedure, instead of being analytic. This allows for a more universal programming and also for transparent reparameterisation.

...or even jumping between dimensions.

For us, if one area must be singled out for its impact on Statistics, it is the ability to deal with variable dimension models, which appear quite naturally in, e.g., changepoint problems, latent variable models, but also in model choice and variable selection. While the Gibbs sampler cannot handle such models, unless one can provide an encompassing model, MCMC techniques have been devised to this purpose. The most popular solution is, by far, the reversible jump technique of Green (1995), where two models M₁ and M₂ are embedded in models, M₁^* and M₂^*, of identical dimension, so that a one-to-one mapping can be constructed between M₁^* and M₂^*. While this device requires a careful implementation, in particular because of the Jacobian of the transformation from M₁^* to M₂^*, it has been used successfully in a considerable number of settings. Alternatives based on jump diffusion or birth-and-death process have also been implemented in realistic settings.

Iteration 2: Does it run?

It should start on its own, ...

The main factor in the success of MCMC algorithms is that they can be implemented with little effort in a large variety of settings. This is obviously true of the Gibbs sampler which, provided some conditional distributions are available, simply runs by generating from these conditionals, as shown by the software BUGS. (The advent of the slice sampler (see xxx) even increases this automation.) So far, this is not exactly the case with MCMC samplers: there is no BUMH equivalent to BUGS, for instance! This is actually rather surprising, given that some Metropolis-Hastings algorithms allow for a very modular programming (Figure 1).This is to say, the prior, the likelihood and the Metropolis-Hastings method can be programmed as separate blocks (procedures). For sampling schemes⁰ like random walks or Langevin diffusions, the scale s of the perturbation can be calibrated against the acceptance rate¹, being thus computed automatically. Moreover, these settings allow for straightforward changes of parameterisation, through the definition of a Jacobian procedure². At last, general convergence assessment tools³ can be added to the blocks. This is not exactly fully automatised programming but at least gives a hint of what the equivalent of BUGS could be. Obviously, this global strategy excludes more specific samplers, like independent samplers, because they more strongly depend on the prior/likelihood, but it works reasonably well in problems of moderate dimensions.

prior

parameterisation₂

sampler₀

calibration₁

likelihood

convergence assessment₃

Figure 1 - Blocks used in modular programming

...and check for convergence, ...

The main difficulty with MCMC samplers, when compared with Monte Carlo techniques, is that their validity is strictly asymptotic: namely, q^(t) is converging in distribution to p. This difficulty is reflected in the impossibility of devising a general stopping rule for MCMC sampler and, correlatively, by the emergence in the last 10 years of many partial convergence criteria which focus on specific convergence issues. Several reviews have appeared in the past (see, e.g., Brooks and Roberts, 1999) and they give a rather exhaustive picture of the state of the art. A feature worth noticing is that the early debate parallel chains vs. unique chain has somehow died out and that most now agree on using a panel of signals which reflect the diversity of convergence issues and of uses of the MCMC sample. This strategy is somehow conservative, since the chain keeps going till the ``slowest" signal turns green, and bias results from using simultaneously several stopping rules. However, these rules are not used in practice with particular attention to significance level nor as exact stopping rules: one runs the sampler and checks every 1000 or 10,000 iterations for sufficient agreement in the stopping rules. A software like CODA (Best et al., 1995), which contains a series of general convergence diagnostics, actually works ex post, that is, once a batch of simulations has been done, and signals whether or not simulations should be continued.

...while improving on-line.

A given sampler will eventually fail to converge in a reasonable time for one particular problem! Solutions to this potential threat, well illustrated through pathological examples, are (a) to use several samplers simultaneously, in hybrid structures, (b) to modify temporarily the distribution to simulate, as in tempering, or (c) to adapt the proposal distribution on line, that is, to use the MCMC sample (or samples) up to iteration T to modify the proposal for iterations T+1 on. In this third case, caution is necessary to handle the heterogeneity of the chain thus generated but some advances have been made in this direction, establishing specific limit theorems, and using renewal theory and nonparametric density estimates.

Another fruitful direction of research covers ways of accelerating MCMC algorithms, that is, to improve excursions over the support of p and/or estimation of I E^p[h(q)]. Among these methods, let us mention simulated tempering (Neal, 1996b), simulated sintering (Liu and Sabati, 1999), particle filter, (Liu and Chen, 1998) control variates (Frigessi et al., 1999) and Rao-Blackwellisation (Robert and Casella, 1999).

Iteration 3: Boosting Bayesian inference

It is simply impossible to give a complete picture of the long-lasting effect of the seminal paper of Gelfand and Smith (1990)! We just mention here some areas of inference which strongly benefitted from MCMC techniques, stressing first that MCMC has clearly led to an upsurge in the use of hierarchical modelling as in graphical models, generalised linear models, image analysis, etc. This is also apparent in other vignettes in this serie, as in Cressie (1999, this volume).

Latent variable models.

The most obvious example of all, namely the mixture of distributions å_{i = 1}^k p_i f(x|x_i) , simply could not be processed properly before the advent of MCMC techniques. Since then, many other latent variable models, that is, distributions that are naturally expressed as marginals

p(q) =

ó
õ

p^* (q,z) dz ,

have been processed, thanks to MCMC techniques: hidden Markov and semi-Markov models, ARCH and switching ARMA models, stochastic volatility and discretized diffusion models, missing data and censored models, deconvolution models are some of those (Robert, 1998), with important consequences for econometrics, signal processing, telecommunication, finance, as shown for instance by the Applications section of JASA over the past years.

Model choice.

The introduction of reversible jump techniques has been beneficial to the area of model choice and variable selection as it allows for prior distributions on the dimension of a model as well as the parameter of each model. This possibility has led to a large amount of work on model choice, with new notions like model averaging, and a renewed interest in (instrinsic and fractional) Bayes factors (Key et al., 1999). In particular, the advances in graphical models are quite impressive, since we now see setups where the graph structures themselves can be simulated, hence allowing to draw inference on the connections between variables (see, e.g., Madigan and York, 1995). Furthermore, as this area of research allows MCMC computations to be performed locally, with considerable performance improvements, very large and structured models can now be tackled properly (Giudici and Green, 1998).

Nonparametric Bayes.

Another area which has changed very much in the last ten years is Bayesian nonparametrics. The tools for density estimation, and curve fitting, and nonparametric or semiparametric regression are now quite different and improved, as shown for instance in Dey et al. (1998). Alternatives to the Dirichlet prior, much criticised for its discrete features, have now appeared, either through the use of other processes (Beta, Levy) or of infinite mixtures. Reversible jump techniques provide an entry for density estimation via mixtures with an unknown number of components, using normal, beta or Bernstein densities. The use of wavelets and the estimation of the wavelet coefficients have also been clearly enhanced by MCMC techniques.

Iteration 4: What's the next step?

The development of MCMC techniques is far from over and we cannot predict what the next major step will be, either at the theoretical or at the implementation level! Nonetheless, we can point out a few emerging, likely or simply desirable, developments.

Perfect simulation.

While this area has been on the move since the seminal paper of Propp and Wilson (1996), who showed that it is possible to use an MCMC sampler to simulate exactly from the distribution of interest by the technique of coupling from the past, it is yet unclear to us whether this technique will be available for statistical models on a scale larger than for a restricted class of discrete models (besides point processes). Recent innovations like the multigamma coupler (Green and Murdoch, 1999) and the perfect slice sampler (Roberts and Mira, 1998), which reduce a continum of starting chains to a finite number of starting points, are steps in this direction, but the difficulty of implementation is considerable and we are still very far from automated versions.

Adaptive and sequential algorithms.

A likely development, both in theory and practice, is an increase in the study of heterogeneous Markov chains and auto-adaptative algorithms, as mentionned above. A related area of considerable potential deals with sequential algorithms, where the distribution p_n (and possibly the dimension of the parameter space) varies with time, as in, for instance, tracking. Developments of techniques like particle filters (Liu and Chen, 1998) and advanced importance sampling, which take avantage of the previous simulations to avoid re-running a sample from scratch, should be able to face more and more real-time applications, like those encountered in finance, image analysis, signal processing or artificial intelligence. Higher order algorithms, like Hamiltonian schemes, could also offer a better potential to deal with these difficult setups.

Large dimensions or datasets.

There is yet too little innovation to deal with models involving either a large number of parameters, as in econometric or graphical models, or huge datasets, as in finance, signal processing, or teletraffic, and the MCMC samplers are currently limited in terms of the models they can handle. Neural networks, which have been investigated by Neal (1996a), are a typical example of such settings, but techniques need be devised for more classical statistical models, where completion, for instance, is too costly to be implemented (see Mau and Newton, 1997, for an illustration in Genetics).

Software.

We mentionned BUGS and CODA as existing software dedicated to MCMC algorithms (more specifically to Gibbs sampling), but there is still much to be done to see MCMC as part of commercial software. This implies that each new problem requires some programming effort, most often in a basic langage like for Fortran or C, given that interpreted langages, like Splus or Gauss, have huge difficulties to deal with loops, which are inherent to Markov chain algorithms. An alternative would be to find ways of completely vectorizing an MCMC sampler, to allow for parallel programming, but there is little done in that direction, so far. New statistical languages, like the Omega project,⁶ based on JAVA, may offer a solution, if they take into account MCMC requirements.

References

Best, N.G., Cowles, M.K. and Vines, K. (1995) CODA: Convergence diagnosis and output analysis software for Gibbs sampling output, Version 0.30. Tech. Report, MRC Biostatistics Unit, Univ. of Cambridge.

Brooks, S.P and Roberts, G. (1999) Assessing convergence of Markov chain Monte Carlo algorithms. Statistics and Computing 8(4), 319-335,

Dey, D., Müller, P., and Sinha, D. (1998) Practical Nonparametric and Semiparametric Bayesian Statistics. Lecture Notes 133, Springer-Verlag, New York.

Frigessi, A., Gåsemyr, J., and Rue, H. (1999) Antithetic coupling of two Gibbs sampler chains. Tech. report, Norwegian Computing Center, Olso.

Gelfand, A.E. and Smith, A.F.M. (1990) Sampling based approaches to calculating marginal densities. J. American Statist. Assoc. 85, 398-409.

Gelman, A., Gilks, W.R. and Roberts, G.O. (1996) Efficient Metropolis jumping rules. In Bayesian Statistics 5, J.O. Berger, J.M. Bernardo, A.P. Dawid, D.V. Lindley and A.F.M. Smith (Eds.). 599-608. Oxford University Press, Oxford.

Gilks, W.R., Richardson, S. and Spiegelhalter, D.I. (1996) Markov Chain Monte Carlo in Practice. Chapman & Hall, London.

Green, P.J. (1995) Reversible jump MCMC computation and Bayesian model determination. Biometrika 82(4), 711-732.

Green, P.J. and Murdoch, D. (1998) Exact sampling for Bayesian inference: towards general purpose algorithms. In Bayesian Statistics 6. J.O. Berger, J.M. Bernardo, A.P. Dawid, D.V. Lindley and A.F.M. Smith (Eds.). Oxford University Press, Oxford.

Giudici, P. and Green, P.J. (1998) Decomposable graphical Gaussian model determination. Technical Report, Università di Pavia.

Key, J., Perrichi, L., and Smith, A.F.M. (1999) Bayesian model choice: what and why? In Bayesian Statistics 6, J.O. Berger, J.M. Bernardo, A.P. Dawid, and A.F.M. Smith (Eds.). Oxford University Press, Oxford.

Liu, J.S., and Chen, R. (1998) Sequential Monte Carlo Methods for Dynamic Systems. J. American Statist. Assoc. 93, 1032-1044

Liu, J.S., and Sabbati, C. (1999) Simulated sintering: Markov Chain Monte Carlo with Spaces of Varying Dimensions. In Bayesian Statistics 6, J.O. Berger, J.M. Bernardo, A.P. Dawid, and A.F.M. Smith (Eds.). Oxford University Press, Oxford.

Madigan, D. and York, J. (1995) Bayesian graphical models for discrete data. International Statistical Review 63, 215-232.

Mau, B. and Newton, M.A. (1997) Phylogenetic Inference on Dendograms Using Markov Chain Monte Carlo Methods J. Comput. Graphical Statistics 6(1), 122-131.

Mengersen, K.L. and Tweedie, R.L. (1996) Rates of convergence of the Hastings and Metropolis algorithms. Ann. Statist. 24, 101-121.

Mira, A. and Roberts, G.O. (1998) Perfect slice samplers. Università degli Studi dell'Insurbia, Varese.

Neal, R.M. (1996a) Bayesian Learning for Neural Networks. Lecture Notes 118, Springer-Verlag, New York.

Neal, R.M. (1996b) Sampling from multimodal distributions using tempered transitions. Statistics and Computing 4, 353-366.

Propp, J.G. and Wilson, D.B. (1996) Exact sampling with coupled Markov chains and applications to statistical mechanics. Random Structures and Algorithms 9, 223-252.

Robert, C.P. (1998) MCMC specifics for latent variable models. In COMPSTAT 1998 (R. Payne and P.J. Green, eds.), 101-112. Physica-Verlag, Heidelberg.

Robert, C.P., and Casella, G. (1999) Monte Carlo Statistical Methods. Springer-Verlag, New York.

Roberts, G.O., and Rosenthal, J.S. (1998) Optimal scaling of discrete approximations to Langevin diffusions. J. Royal Statis. Soc. 60, 255-268.

Roberts, G.O., and Rosenthal, J.S. (1999) Convergence of slice sampler Markov chains. J. Royal Statis. Soc. 61, 643-660.

Roberts, G.O., and Tweedie, R. (1999) Bounds on regeneration times and convergence rates for Markov chains. Stoch. Proc. Applic., 80, 211-229.

Rosenthal J.S. (1995) Minorization Conditions and Convergence Rates for Markov Chain Monte Carlo. J. American Statist. Assoc., 90, 558-566.

Footnotes:

¹ http://www.stats.bris.ac.uk/MCMC/

² http://dimacs.rutgers.edu/ ~ dbwilson/exact.html

³ http://www.ensae.fr/crest/statistique/robert/McDiag

⁴ http://web.mat.uniroma3.it/users/breyer/java/

⁵ http://www.mrc-bsu.cam.ac.uk/bugs/Welcome.html

⁶ http://www.omegahat.org/

Thanks to Paolo Giudici and Peter Green for comments and suggestions!

File translated from T_EX by T_TH, version 2.00.
On 21 Jul 1999, 12:59.