Nested sampling (coms on coms)

Skilling's reply (with my comments)

Regarding your comments, I have a few initial reactions.

Right after writing our comments on nested sampling and sending it to the author, we received this email from John Skilling. While the overall bullying tone is fairly annoying, but can be ignored, there were a few points that deserved attention. In particular, I am still uncertain about the overall potential of the method, i.e. whether sampling within a more and more limited region of higher and higher likelihood can bring an improvement over regular sampling techniques. (In the paper, there is no sign of a measure of this improvement.) There is also the fact that we won't be granted the luxury of a reply to the reply. X

1. To say that you are "less confident than the author" invites a request for justification. The author has tried the method, but have you? For example, here's the start of another email that arrived this morning from a bright young colleague:

Hi John,
I know that this praise is late in coming, but your nested sampling algorithm is great!!!
I wrote up a Matlab version based on your C code and it works so nicely.
I was wondering how this can best be parallelized....

Makes me feel really sorry to be neither bright, nor young...! Sounds too much like a commercial, though!!!

Nested sampling might actually work as claimed, in which case you will find yourself in print but on the wrong side of history.

I did try the method and wrote a small R program on a toy example to this effect. As N goes to infinity, it indeed converges to the true value, for the same reason Rieman sums are converging to integrals. So overall I am indeed less confident that it is a valuable method for approximating marginals because it suffers from the same drawbacks as numerical integration, including the curse of dimensionality.

2. If you wish to describe my elementary justification of (4) as "convoluted", then you may do so (both Z's are upper-case, by the way). Other readers, perhaps less sophisticated than yourselves, might be grateful for the simplicity of evaluating an integral in any order, rather than diverting through expectations and complementary cumulative distribution functions. Professional statisticians may prefer their specialised jargon, but my judgment is that a general idea like nested sampling deserves a wider readership.

Overlooking the condescending tone of the sentence, I still do not see how one can do simpler than using a trapezoidal or step-function approximation to an integral as given at the top of page 2 of our discussion. See also Figure 3 in the paper. The nested sampling method sums up as the use of a step-function with points generated from the prior restricted to higher and higher regions of the likelihood.

3. Using the same notation for L(theta) and L(X) is merely what's done in computing all the time, where functions are routinely overloaded according to the type of argument. We have long grown past the time when the programmer had to specify different function names, such as SQRT, DSQRT, CSQRT and so on, for single precision, double precision, complex etc. argument. We just say "sqrt", without ambiguity. Likewise, I could have suffixed L_X(X) and L_theta(theta), but chose not to do so. Fussy notation like that is trendy, but it carries no information and obscures the message of the equations. It's even counter-productive. For example, in an application involving lengths, I might legitimately use L_x for a length function, but L_theta for a likelihood. It is then the reader's responsibility to keep the concepts apart. By overloading the symbol L, my references necessarily mean the same quantity each time. This puts the responsibility for coherence back where it belongs, on the author, and simplifies matters for the reader.

Difficult to argue against the fact that using the same notation for two similar but different objects in the same paragraph is not the same thing as using twice the same symbol in a computer program. Even more difficult to see that as trendy...

4. If your unbounded likelihood is integrable to finite evidence, that's fine by me. I don't then refer to it because I don't need to. If it's NOT integrable, so that the evidence is infinite, then I'm not interested because inference problems are not like that.

Infinite marginals are obviously of no interest from a Bayesian point of view, but for an infinite likelihood with finite marginal, I wonder about the convergence of the algorithm. This is not obvious for the trapezoidal approximation.

5. My claim is that nested-sampling exploration within a hard constraint is "... (IF ANYTHING) easier than the traditional Metropolis-Hastings...". You italicize the last part of the quote while omitting my qualifier (my capitals). By this selective quotation, you imply that my claim is stronger than I meant. I know perfectly well that there's very little difference, and I say so. As it happens, I spell this out more clearly in equation (32) (Metropolis-Hastings) and (33) (nested sampling), equivalently (18) and (19) of the revised paper. Either way, you are wrong to say that my claim is "certainly not" valid.

Either one needs to simulate directly from the prior under the likelihood constraint and this implies likelihood-weighting, or one uses a Metropolis step and then it is exactly Metropolis-Hastings.

6. I've tried to reference such work as lies on the road to nested sampling. On bridge sampling, I refer to Gelman & Meng, but I do not intend to divert the reader into a broader corpus of material that goes in different directions. Reversible-jump can work if models are sufficiently similar, but not otherwise: it's not what I'm discussing, so I do not wish to refer to it.

The central motivation for the method in the paper is the approximation of the marginal (evidence), which in its turn is mostly useful for model comparison. So it would make sense to compare nested sampling with tools that have been around for quite a while for computing Bayes factors. Overall, I am a bit wary of giving a specific meaning to evidence per se : it is only the ratios of marginals that matter. (The revision of the paper stress this very point in its introduction.) Reversible-jump is designed for that purpose, visiting each model in proportion to its evidence, so I do not understand why this is not relevant... The dismissal of its efficiency when models are not sufficiently similar (meaning?) is rather cavalier, given ten years of evidence to the opposite in the literature.

7. Well, I think equation (4) is central, not useless. What matters is not whether the intervals DeltaX all converge to 0 (they don't, and I say so) but whether the integrand L is smooth enough (which, being decreasing, it is). I'm not really interested in large N, because my N is usually small. And none of us should need a (2004) reference to explain the simple Riemann integral that dates back to Newton and Leibnitz. That said, it would be worrying if large N didn't give the right asymptotic limit. But it's OK; it has to be. I have tried to address this, at least partially, on pages 12 and 13 of the revise.

I definitely do not wish to fight ownership with Newton and Leibniz. Simply, using a mix of simulation and of Riemann integration has been around for quite a while in the literature, as for instance in Yakowitz et al. (1978) or Philippe (1997). The main interest of these mixed methods is that the error variance is decreasing faster than with usual Monte Carlo methods, the smoother the integrand, the faster the convergence actually. (Hence the reference to our book, Section 4.3.)
Another point worth mentioning is that convergence issues are rather speedily evacuated from the paper. Again, when running the above R program, the approximation does converge to the true value at a speed of O(sqrt(N)), as shown in the discussion of the paper by Raftery et al.

8. I'm not concerned with optimality, and how could an algorithm be proved optimal anyway? I merely point out that my choice of simulating theta from pi(theta) is possible. If you think you can do better, try it.

This is still a central issue in simulation for Bayesian statistics: simulating from the prior is not efficient when the likelihood is informative because this induces a waste of simulations in low likelihood regions. And yes, there exist optimality results within the Monte Carlo literature (as for instance the choice of importance functions that minimise the variance or some entropy measure).

9. Vague priors induce low evidence values. That's why a vague prior will be out-performed by any prior that expresses sensible knowledge about the problem in question --- the sort of decent order-of- magnitude judgment that any professional worker ought to have about his application. There's no merit in deliberate vagueness, and this message is reinforced by proper calculations which will give unnecessarily small Z.

10. Improper priors are, and always were, silly. They give Z=0. Is a user supposed to be so appallingly ignorant that he doesn't know AT ALL even the vaguest idea of the order-of-magnitude of what he's looking at? I've never met such a user, and I would refuse to pander to such perverse ignorance if I did.

This is a usual confusion on the superiority of proper vague priors over improper priors. Vague priors are no help at all in testing and model choice setups, as shown by the Lindley-Jeffreys paradox. Having a vague idea of the order-of-magnitude is not enough to provide a well-defined Bayes factor, unfortunately. See Berger (2000, vignette in JASA) for more discussion of this issue.

11. I'm sorry for Mira et al. (2001) if they have problems for which it's impossible to sample from L > constraint. I can't see how such a problem could arise, but if it does then I can't help them. I specialise in soluble problems.

It is most surprising to advance ignorance as an argument! Slice sampling is related to nested sampling and this paper of Mira, Moeller and Roberts is of course relevant because it exhibits the proper order associated with slice sampling, which is exactly the likelihood ordering. The authors thus build up a very generic slice sampler with monotonicity properties, but simulating from the slices is not always feasible.

12. Oh, come on! Get real. OF COURSE Fig 1, Fig 2 and the rest aren't to scale. I say very clearly that the bulk of the integral lies at exponentially small X. What am I supposed to do? Draw a graph that's far far thinner than a single molecule of ink?

13. MCMC doesn't have to offer any justification for a finite number of simulations. That's been done already, with the integral (4) approximated as a finite sum. Nested sampling just asks for a constrained new sample. It doesn't care how it was obtained. In practice, it's exponentially hopeless to start from the beginning each time (I agree with you there), so it makes sense to start from where one is, and use MCMC to approximate the next sample point. So MCMC is the obvious tool for implementing nested sampling.

14. I have no idea why you should think that any reader would just try a single iteration of MCMC and give up if it failed. I certainly don't suggest that.

Actually, using a single step of MCMC is just as valid as using T=1000 or T=1,000,000 iterations in this setting.

15. You say that adding a single value [= new point?] to the sample of N points seems inefficient. But your suggestion of getting a whole new generation of N points also looks inefficient, especially since N-1 of them are already available for free.

It really depends how one defines efficiency: if all of the N points are at rather low likelihood values (which may clearly happen given that one is simulating from the prior distribution), then using these N points does not quickly improve the approximation of the marginal. I agree that it is not clear why simulating N new points would improve the convergence, though, if those are still simulated from the truncated prior distribution.

16. Well of course multimodality is not treated fully. It couldn't be. No sampling method is EVER going to find the primary flagpole in the middle of the waves of a secondary ocean. That's not a comment on nested sampling --- it is a comment on sampling methods in general. If the primary mode is missed, it is (exponentially almost) certain that it will NOT be recovered later. Uncertainty hardly comes into it.

17. I can't describe how I sample X very much more clearly than I did
Xnext / Xthis = Uniform(0,1) equation (14)
though I have tried --- see equations (8) and (9) of the revise. If this is still too "fleeting", then please use your deeper understanding to suggest what I should say instead... Perhaps your misconception is that the X's are "attached" to the theta's. As far as I, the analyst, am concerned, they are NOT attached because I DO NOT KNOW how to do the sort. We're doing Bayesian inference here, within the algorithm. Sampling Xnext itself wouldn't make sense, but sampling the conditional (Xnext|Xthis) DOES make sense. It's the key step, and I have to say that your first paragraph of "Approximating the X's" seems to show that you've missed the point. I hope that my revise is even more clear about this. As I said in my talk, and repeat there, it's the difference between incompetence regarding X and ignorance about X that's critical.

I think that our integral representation of Z shows that we understand that the X_i's are attached to L_i rather than to a value of theta and that sampling theta's rather than directly the L's is a matter of convenience. I also agree that simulating the X_i's uniformly under the constraint is simular to simulating the theta's from pi under the corresponding constraint. An interesting point: simulating a new X_i conditional on L_i being larger than the previous L produces a value with the same distribution as the remaining X_j's.

18. If you object to my "confusing" notation, what am I to make of
           -1
     L = X (X )   ?
      i        i

19! I oppose any attempt to damage a Bayesian algorithm by a Rao-Blackwell fudge!

This again is a surprising argument given that algorithms are based on frequentist principles like the Law of Large Numbers! The name of Rao-Blackwellisation may be poorly chosen since it relates to a frequentist domination result, but the worth of algorithms is assessed in terms of speed of convergence and of variance reduction. Integrating out part of the randomness reduces the variance without damaging the algorithm in any objective respect.
In addition, those high Bayesian principles are soon forgotten in the derivation of the uncertainty on log(Z) in Section 2.4, which sounds rather frequentist, to say the least.

20. exp(-i/N) is not the expectation of Xi, and I never say it is. Try it. In large problems, sampling methods quite generally yield log(Z), not Z. So talk of expectations confuses the issue, which is another reason why I deliberately set up Z as an integral in equation (1), and did NOT use your initial formulation as an expectation. I want to point readers in a constructive direction.

We completely agree on that point that the expectation of X_i is not exp(-i/N). The approximation is still based on an expectation, on formula (18). Note that the expectation of X_i is known, since the t_i's are [conditionaly] Beta(N+1,1) distributed, as given in formula (17) of the paper; therefore, the [marginal] expectation of X_i is (N/N+1)ⁱ and this could be used as well. Note that formula (26) seems to be wrong since exp(-i/N) is an approximation of X_i, not of w_i.

21. I'm quite open about the fact that in a big problem with information H = 1000000 (say), it will take me 1000000 iterates to compress the domain by the required factor exp(1000000). I'm also open about the factor N slow-down needed to improve accuracy by sqrt(N). Can you do better?

As mentioned above, Yakowitz et al. (1978) found rates of convergence of order at least O(N) and higher for smoother functions L.

22. As for examples, my strategy is deliberately minimal. I already say what I want in my section on the thermal alternative. I was taught never to support a strong argument with a weak one. It only leads the reader to suppose that, in attacking and perhaps defeating the weak argument, he has defeated the conclusions. I'm confident that your "standard statistical examples" will work, but the discussion of them would become embedded in details of choice of example, exploration strategy, MCMC convergence, CPU comparisons and the like, which are nothing to do with the idea of nested sampling. I'm not getting into that trap. If you or anybody else wants to fairly assess such examples and write them up in useful comments, positive or negative, that's fine. Source code is freely available from the website in the Acknowledgments, and it is easy to program an application. Go ahead. But I do not wish to dilute the inner logic of this paper. That's my decision. I'm claiming the possibility of progress with a wider class of problems than has hitherto been solved. Reverting to already solved standard problems would dilute that message.

John Skilling

Last updated, June 29, 2006