Skilling's reply (with my comments)
Regarding your comments, I have a few initial reactions.
|Right after writing our comments on nested sampling and
sending it to the author, we received this email from John Skilling.
While the overall bullying tone is fairly annoying, but can be ignored,
there were a few points that deserved attention. In particular, I am
still uncertain about the overall potential of the method, i.e.
whether sampling within a more and more limited region of higher and
higher likelihood can bring an improvement over regular sampling
techniques. (In the paper, there is no sign of a measure of this
improvement.) There is also the fact that we won't be granted the
luxury of a reply to the reply. X
1. To say that you are "less
confident than the author" invites a request for justification.
The author has tried the method, but have you? For example,
here's the start of another email that arrived this morning from a
bright young colleague:
I know that this praise is late in coming, but your nested sampling algorithm is great!!!
I wrote up a Matlab version based on your C code and it works so nicely.
I was wondering how this can best be parallelized....
|Makes me feel really sorry to be neither bright, nor young...! Sounds too much like a commercial, though!!!
Nested sampling might actually work as claimed, in which case you will find yourself in print but on the wrong side of history.
|I did try the method and wrote a small R program on a toy example to this effect. As N
goes to infinity, it indeed converges to the true value, for the same
reason Rieman sums are converging to integrals. So overall I am indeed less confident
that it is a valuable method for approximating marginals because it
suffers from the same drawbacks as numerical integration, including the
curse of dimensionality.
2. If you wish to describe my elementary justification of (4) as
"convoluted", then you may do so (both Z's are upper-case, by the
way). Other readers, perhaps less sophisticated than yourselves,
might be grateful for the simplicity of evaluating an integral in any
order, rather than diverting through expectations and complementary
cumulative distribution functions. Professional statisticians may
prefer their specialised jargon, but my judgment is that a general idea
like nested sampling deserves a wider readership.
|Overlooking the condescending tone of the sentence, I still
do not see how one can do simpler than using a trapezoidal or
step-function approximation to an integral as given at the top of
page 2 of our discussion. See also Figure 3 in the paper. The nested
sampling method sums up as the use of a step-function with points
generated from the prior restricted to higher and higher regions of the
3. Using the same notation for L(theta) and L(X) is merely what's done
in computing all the time, where functions are routinely overloaded
according to the type of argument. We have long grown past the
time when the programmer had to specify different function names, such
as SQRT, DSQRT, CSQRT and so on, for single precision, double
precision, complex etc. argument. We just say "sqrt", without
ambiguity. Likewise, I could have suffixed L_X(X) and
L_theta(theta), but chose not to do so. Fussy notation like that
is trendy, but it carries no information and obscures the message of
the equations. It's even counter-productive. For example,
in an application involving lengths, I might legitimately use L_x for a
length function, but L_theta for a likelihood. It is then the
reader's responsibility to keep the concepts apart. By
overloading the symbol L, my references necessarily mean the same
quantity each time. This puts the responsibility for coherence
back where it belongs, on the author, and simplifies matters for the
|Difficult to argue against the fact that using the same notation for two similar but different objects in the same paragraph is not the same thing as using twice the same symbol in a computer program. Even more difficult to see that as trendy...
4. If your unbounded likelihood is integrable to finite evidence,
that's fine by me. I don't then refer to it because I don't need
to. If it's NOT integrable, so that the evidence is infinite,
then I'm not interested because inference problems are not like that.
|Infinite marginals are obviously of no interest from a Bayesian point of view, but for
an infinite likelihood with finite marginal, I wonder about the
convergence of the algorithm. This is not obvious for the trapezoidal
5. My claim is that nested-sampling exploration within a hard
constraint is "... (IF ANYTHING) easier than the traditional
Metropolis-Hastings...". You italicize the last part of the
quote while omitting my qualifier (my capitals). By this
selective quotation, you imply that my claim is stronger than I
meant. I know perfectly well that there's very little difference,
and I say so. As it happens, I spell this out more clearly in
equation (32) (Metropolis-Hastings) and (33) (nested sampling),
equivalently (18) and (19) of the revised paper. Either way, you
are wrong to say that my claim is "certainly not" valid.
|Either one needs to simulate directly from the prior under the likelihood constraint and this implies likelihood-weighting, or one uses a Metropolis step and then it is exactly Metropolis-Hastings.
6. I've tried to reference such work as lies on the road to nested
sampling. On bridge sampling, I refer to Gelman & Meng, but I
do not intend to divert the reader into a broader corpus of material
that goes in different directions. Reversible-jump can work if
models are sufficiently similar, but not otherwise: it's not what I'm
discussing, so I do not wish to refer to it.
|The central motivation for the method in the paper is the approximation of the marginal (evidence),
which in its turn is mostly useful for model comparison. So it would
make sense to compare nested sampling with tools that have been
around for quite a while for computing Bayes factors. Overall, I am a
bit wary of giving a specific meaning to evidence per se
: it is only the ratios of marginals that matter. (The revision of the
paper stress this very point in its introduction.) Reversible-jump is
designed for that purpose, visiting each model in proportion to its evidence, so I do not understand why this is not relevant... The dismissal of its efficiency when models are not sufficiently similar (meaning?) is rather cavalier, given ten years of evidence to the opposite in the literature.
7. Well, I think equation (4) is central, not useless. What
matters is not whether the intervals DeltaX all converge to 0 (they
don't, and I say so) but whether the integrand L is smooth enough
(which, being decreasing, it is). I'm not really interested in
large N, because my N is usually small. And none of us should
need a (2004) reference to explain the simple Riemann integral that
dates back to Newton and Leibnitz. That said, it would be
worrying if large N didn't give the right asymptotic limit. But
it's OK; it has to be. I have tried to address this, at least
partially, on pages 12 and 13 of the revise.
|I definitely do not wish to fight ownership with Newton and
Leibniz. Simply, using a mix of simulation and of Riemann integration
has been around for quite a while in the literature, as for instance in
Yakowitz et al. (1978) or Philippe (1997). The main interest of these
mixed methods is that the error variance is decreasing faster than with
usual Monte Carlo methods, the smoother the integrand, the faster the
convergence actually. (Hence the reference to our book, Section 4.3.)
Another point worth mentioning is that convergence issues are rather
speedily evacuated from the paper. Again, when running the above R
program, the approximation does converge to the true value at a speed
of O(sqrt(N)), as shown in the discussion of the paper by Raftery et al.
8. I'm not concerned with optimality, and how could an algorithm be
proved optimal anyway? I merely point out that my choice of
simulating theta from pi(theta) is possible. If you think you can
do better, try it.
|This is still a central issue in simulation for Bayesian
statistics: simulating from the prior is not efficient when the
likelihood is informative because this induces a waste of simulations
in low likelihood regions. And yes, there exist optimality results
within the Monte Carlo literature (as for instance the choice of
importance functions that minimise the variance or some entropy
9. Vague priors induce low evidence values. That's why a vague
prior will be out-performed by any prior that expresses sensible
knowledge about the problem in question --- the sort of decent
order-of- magnitude judgment that any professional worker ought to have
about his application. There's no merit in deliberate vagueness, and
this message is reinforced by proper calculations which will give
unnecessarily small Z.
10. Improper priors are, and always were, silly. They give
Z=0. Is a user supposed to be so appallingly ignorant that he
doesn't know AT ALL even the vaguest idea of the order-of-magnitude of
what he's looking at? I've never met such a user, and I would refuse to
pander to such perverse ignorance if I did.
|This is a usual confusion on the superiority of proper
vague priors over improper priors. Vague priors are no help at all in
testing and model choice setups, as shown by the Lindley-Jeffreys
paradox. Having a vague idea of the order-of-magnitude is not
enough to provide a well-defined Bayes factor, unfortunately. See
Berger (2000, vignette in JASA) for more discussion of this issue.
11. I'm sorry for Mira et al. (2001) if they have problems for which
it's impossible to sample from L > constraint. I can't see how
such a problem could arise, but if it does then I can't help
them. I specialise in soluble problems.
|It is most surprising to advance ignorance as an argument! Slice
sampling is related to nested sampling and this paper of Mira, Moeller
and Roberts is of course relevant because it exhibits the proper order
associated with slice sampling, which is exactly the likelihood
ordering. The authors thus build up a very generic slice sampler with
monotonicity properties, but simulating from the slices is not always
12. Oh, come on! Get real. OF COURSE Fig 1, Fig 2 and the
rest aren't to scale. I say very clearly that the bulk of the
integral lies at exponentially small X. What am I supposed to
do? Draw a graph that's far far thinner than a single molecule of
13. MCMC doesn't have to offer any justification for a finite number of
simulations. That's been done already, with the integral (4)
approximated as a finite sum. Nested sampling just asks for a
constrained new sample. It doesn't care how it was obtained. In
practice, it's exponentially hopeless to start from the beginning each
time (I agree with you there), so it makes sense to start from where
one is, and use MCMC to approximate the next sample point. So
MCMC is the obvious tool for implementing nested sampling.
14. I have no idea why you should think that any reader would just try
a single iteration of MCMC and give up if it failed. I certainly
don't suggest that.
|Actually, using a single step of MCMC is just as valid as using T=1000 or T=1,000,000 iterations in this setting.
15. You say that adding a single value [= new point?] to the sample of
N points seems inefficient. But your suggestion of getting a
whole new generation of N points also looks inefficient, especially
since N-1 of them are already available for free.
|It really depends how one defines efficiency: if all of
the N points are at rather low likelihood values (which may clearly
happen given that one is simulating from the prior distribution), then
using these N points does not quickly improve the approximation of the
marginal. I agree that it is not clear why simulating N new points would improve the convergence, though, if those are still simulated from the truncated prior distribution.
16. Well of course multimodality is not treated fully. It
couldn't be. No sampling method is EVER going to find the primary
flagpole in the middle of the waves of a secondary ocean. That's
not a comment on nested sampling --- it is a comment on sampling
methods in general. If the primary mode is missed, it is
(exponentially almost) certain that it will NOT be recovered
later. Uncertainty hardly comes into it.
17. I can't describe how I sample X very much more clearly than I did
Xnext / Xthis = Uniform(0,1) equation (14)
though I have tried --- see equations (8) and (9) of the revise.
If this is still too "fleeting", then please use your deeper
understanding to suggest what I should say instead... Perhaps
your misconception is that the X's are "attached" to the theta's.
As far as I, the analyst, am concerned, they are NOT attached because I
DO NOT KNOW how to do the sort. We're doing Bayesian inference
here, within the algorithm. Sampling Xnext itself wouldn't make
sense, but sampling the conditional (Xnext|Xthis) DOES make
sense. It's the key step, and I have to say that your first
paragraph of "Approximating the X's" seems to show that you've missed
the point. I hope that my revise is even more clear about this.
As I said in my talk, and repeat there, it's the difference between
incompetence regarding X and ignorance about X that's critical.
|I think that our integral representation of Z shows that
we understand that the Xi's are attached to Li rather
than to a value of theta and that sampling theta's rather than directly
the L's is a matter of convenience. I also agree that simulating the Xi's uniformly under the constraint is simular to simulating the
theta's from pi under the corresponding constraint. An interesting
point: simulating a new Xi conditional on Li being larger than the
previous L produces a value with the same distribution as the remaining Xj's.
18. If you object to my "confusing" notation, what am I to make of
L = X (X ) ?
19! I oppose any attempt to damage a Bayesian algorithm by a Rao-Blackwell fudge!
|This again is a surprising argument given that
algorithms are based on frequentist principles like the Law of
Large Numbers! The name of Rao-Blackwellisation may be poorly chosen
since it relates to a frequentist domination result, but the worth of
algorithms is assessed in terms of speed of convergence and of variance
reduction. Integrating out part of the randomness reduces the variance without damaging the algorithm in any objective respect.
In addition, those high Bayesian principles are soon forgotten in the
derivation of the uncertainty on log(Z) in Section 2.4, which sounds
rather frequentist, to say the least.
20. exp(-i/N) is not the expectation of Xi, and I never say it is. Try
it. In large problems, sampling methods quite generally yield
log(Z), not Z. So talk of expectations confuses the issue, which
is another reason why I deliberately set up Z as an integral in
equation (1), and did NOT use your initial formulation as an
expectation. I want to point readers in a constructive direction.
|We completely agree on that point that the expectation of Xi
is not exp(-i/N). The approximation is still based on an expectation,
on formula (18). Note that the expectation of Xi
is known, since the ti's are [conditionaly] Beta(N+1,1) distributed,
as given in formula (17) of the paper; therefore, the [marginal]
expectation of Xi is (N/N+1)i and this could be used as well. Note that formula (26) seems to be wrong since exp(-i/N) is an approximation of Xi, not of wi.
21. I'm quite open about the fact that in a big problem with
information H = 1000000 (say), it will take me 1000000 iterates to
compress the domain by the required factor exp(1000000). I'm also
open about the factor N slow-down needed to improve accuracy by
sqrt(N). Can you do better?
|As mentioned above, Yakowitz et al. (1978) found rates
of convergence of order at least O(N) and higher for smoother functions
22. As for examples, my strategy is deliberately minimal. I
already say what I want in my section on the thermal alternative. I was
taught never to support a strong argument with a weak one. It
only leads the reader to suppose that, in attacking and perhaps
defeating the weak argument, he has defeated the conclusions. I'm confident that your "standard statistical
examples" will work, but the discussion of them would become embedded
in details of choice of example, exploration strategy, MCMC
convergence, CPU comparisons and the like, which are nothing to do with
the idea of nested sampling. I'm not getting into that
trap. If you or anybody else wants to fairly assess such examples
and write them up in useful comments, positive or negative, that's
fine. Source code is freely available from the website in the Acknowledgments, and it is
easy to program an application. Go ahead. But I do not wish
to dilute the inner logic of this paper. That's my
decision. I'm claiming the possibility of progress with a wider
class of problems than has hitherto been solved. Reverting to already solved standard problems would dilute that message.
Last updated, June 29, 2006