4.1. Introduction to MCMC and the Bayesian method

Users familiar with the Markov Chain Monte Carlo (MCMC) method may want to skip to the next section. The typical problem the user will want to tackle with this program is the problem of parameter estimation of a given theoretical model confronted with one or more sets of observational data. This is a very common task in Cosmology these days, specially in the light of numerous data from several surveys, with increasing quality. Important discoveries are expected to be made with the data from new generation telescopes in the next decade.

In the following I give a very brief introduction to the MCMC technique and describe how to use program.

The Bayes Theorem

Bayesian inference is based on the inversion of the data-parameters probability relation, which is the Bayes theorem [1]. This theorem states that the posterior probability \(P(\theta \mid D, \mathcal{M})\) of the parameter set \(\theta\) given the data \(D\) and other information from the model \(\mathcal{M}\) can be given by

\[P(\theta \mid D, \mathcal{M}) = \frac{\mathcal{L}(D \mid \theta, \mathcal{M}) \, \pi(\theta \mid \mathcal{M})}{P(D, \mathcal{M})},\]

where \(\mathcal{L}(D \mid \theta, \mathcal{M})\) is the likelihood of the data given the model parameters, \(\pi(\theta \mid \mathcal{M})\) is the prior probability, containing any information known a priori about the distribution of the parameters, and \(P(D, \mathcal{M})\) is the marginal likelihood, also popularly known as the evidence, giving the normalization of the posterior probability. The evidence is not required for the parameter inference but is essential in problems of selection model, when comparing two or more different models to see which of them is favored by the data.

Direct evaluation of \(P(\theta \mid D, \mathcal{M})\) is generally a difficult integration in a multiparameter space that we do not know how to perform. Usually we do know how to compute the likelihood \(\mathcal{L}(D \mid \theta, \mathcal{M})\) that is assigned to the experiment (most commonly a distribution that is Gaussian on the data or the parameters), thus the use of the Bayes theorem to give the posterior probability. Flat priors are commonly assumed, which makes the computation of the right-hand side of the equation above trivial. Remember that the evidence is a normalization constant not necessary for us to learn about the most likely values of the parameters.

The Metropolis-Hastings sampler

The MCMC method shifts the problem of calculating the unknown posterior probability distribution in the entire space, which can be extremly expensive for models with large number of parameters, to the problem of sampling from the posterior distribution. This is possible, for example, by growing a Markov chain with new states generated by the Metropolis sampler [2].

The Markov chain has the property that every new state depends on its current state, and only on this current state. Dependence on more previous states or on some statistics involving all states is not allowed. That can be done and can even also be useful for purposes like ours, but then the chain can not be called Markovian.

The standard MCMC consists of generating a random state \(y\) according to a proposal probability \(Q({} \cdot \mid x_t)\) given the current state \(x_t\) at time \(t\). Then a random number \(u\) is drawn from a uniform distribution between 0 and 1. The new state is accepted if \(r \ge u\), where

\[r = \min \left[1, \frac{P(y \mid D, \mathcal{M}) Q(x_t \mid y)}{P(x_t \mid D, \mathcal{M}) Q(y \mid x_t)} \right].\]

The fraction is the Metropolis-Hastings ratio. When the proposal function is symmetrical, \(\frac{Q(x_t \mid y)}{Q(y \mid x_t)}\) reduces to 1 and the ratio is just the original Metropolis ratio of the posteriors. If the new state is accepted, we set \(x_{t+1} := y\), otherwise we repeat the state in the chain by setting \(x_{t+1} := x_t\).

The acceptance rate \(\alpha = \frac{\text{number of accepted states}}{\text{total number of states}}\) of a chain should be around 0.234 for optimal efficiency [3]. This can be obtained by tuning the parameters of the function \(Q\). In this implementation, I use a multivariate Gaussian distribution with a diagonal covariance matrix \(S\).

The Parallel Tempering algorithm (removed in this version)

Standard MCMC is powerful and works in most cases but there are some problems where the user may be better off using some other method. Due to the characteristic behavior of a Markov chain, it is possible (and even likely) that a chain become stuck in a single mode of a multimodal distribution. If two or more peaks are far away from each other, the proposal function tuned for good performance in a peak may have difficulty escaping that peak to explore the other, because the jump may be too short. To overcome this inefficiency, a neat variation of MCMC, called Parallel Tempering [4], favors a better exploration of the entire parameter space in such cases thanks to an arrangement of multiple chains that are run in parallel, each one with a different ‘’temperature’’ \(T\). The posterior is calculated as \(\mathcal{L}^{\beta} \pi\), with \(\beta = 1/T\). The first chain is the one that corresponds to the real life posterior we are interested in; the other chains, at higher temperatures, will have wider distributions, which makes it easier to jump between peaks, thus exploring more properly the parameter space. Periodically, a swap of states between neighboring chains is proposed and accepted or rejected according to a Hastings-like ratio.


[1]Hobson M. P., Jaffe A. H., Liddle A. R., Mukherjee P. & Parkinson D., “Bayesian methods in cosmology”. (Cambridge University Press, 2010).
[2]Gayer C., “Introduction to Markov Chain Monte Carlo”. in “Handbook of Markov Chain Monte Carlo” http://www.mcmchandbook.net/
[3]Roberts G. O. & Rosenthal J. S., “Optimal scaling for various Metropolis-Hastings algorithms”. Statistical Science 16 (2001) 351-367.
[4]Gregory P. C., “Bayesian logical data analysis for the physical sciences: a comparative approach with Mathematica support”. (Cambridge University Press, 2005).