4.4. Running MCMC

This is the vanilla Monte Carlo Markov Chain with the Metropolis algorithm, as introduced in the previous section The Metropolis-Hastings sampler.

We will now proceed to run MCMC for the \(\Lambda\text{CDM}\) model. All the following commands must be issued from your terminal at the EPIC directory.

Initializing the chains

Standard MCMC is the default option for sampling. The chains are initialized with:

python initialize_chains.py LCDM.ini

We already know the .ini file, a configuration file containing all the information for that specific simulation. It defines the parameters names and priors, their TeX strings, the type of cosmological model, which will determine how to calculate distances and the observables and sets a label for the model.

The code will ask the number of chains to be used. It must be at least 2. You can also run the command above with chains=8 as an argument to suppress this prompt. Each chain will create an independent process with Python’s multiprocessing, so you should not choose a number higher than the number of CPUs multiprocessing.cpu_count() of your machine.

This command will create a folder with the files that will store the states of the chains, named reducedchain1.txt, reducedchain2.txt, etc, besides other relevant files. The sampling mode (MCMC or PT) is prepended to the name of the folder, which is always the date-time of creation (in UTC), unless the option cfullname=MyFirstRun (c standing for custom) is used, then MCMC-MyFirstRun will be the folder name. A custom label can also be prepended with cname=my_label creating, for example, the folder my_label-MCMC-171110-173000. It will be stored within another folder with the same name of the .ini file, thus in this case LCDM/my_label-MCMC-170704-201125. The folder name will be displayed in the first line of the output.

Alternatively, this first command can be skipped by running the main file epic.py directly, as I explain next. The main file can distinguish if the first argument is a .ini file or a directory. In the first case, it will call initialize_chains.py with the same arguments. In the second case, if the path given corresponds to a MCMC or PT (signaled by the mode.txt file) directory just created with initialize_chains.py or any previously initiated simulation, it will continue to evolve its chains from where they stopped.

Sampling the posterior distribution

The magic starts with:

python epic.py <FULL-PATH-TO>/LCDM/MCMC-MyFirstRun/

or simply:

python epic.py LCDM.ini

if you have not initialized the chains yet with the previous command or want to start a new run. Two parameters of the code will be prompted:

Please specify the number of steps in each passage in the MCMC loop: 10000
Please specify the tolerance for convergence: 1e-2

but these can also be informed directly when executing epic.py in the form of arguments steps=3000 and tol=1e-2. This number of steps means that the chains will be written to the disk (the new states are appended to the chains files) after each steps states in all chains. A large number prevents frequent writing operations, which could otherwise affect overall performance unnecessarily. The relevant information will be displaying, in our example case looking similar to the following:

Loading INI file...
    Reading simplified JLA dataset...
        jla_likelihood_v4
    Reading distance priors from Planck 2015...
        Planck2015_LCDM
    Reading BAO (6dF+SDSS+BOSS+Lyalpha+WiggleZ) data...
        BAO-6dF+SDSS+BOSS+Lyalpha+WiggleZ.txt
    Reading cosmic chorometer H(z) data...
        thirtypointsHz.txt
File <FULL-PATH-TO>/LCDM/MCMC-MyFirstRun/

and the MCMC will start.

In the MCMC mode, the code will periodically check for convergence according to the Gelman-Rubin method (by default it is done every two hours but can be specified differently as 12h or 45min in the arguments, for example. Ideally this does not need to be a small time interval, but the option of specifying this time in minutes or even in seconds (30sec) is implemented and available for testing purposes. The MCMC run stops if convergence is achieved with a tolerance smaller than tol.

The following is the output of our example after the MCMC has started. The 10000 steps take a bit less than twelve minutes in my workstation running 8 chains in parallel. The number of chains will not make much impact on this unless we use too many steps by iteration and are close to the limit of memory. After approximately two hours, convergence is checked. Since it is smaller than our required tol, the code continues with new iterations for more two hours before checking convergence again and so on. When convergence smaller than tol is achieved the code makes the relevant plots and quits.

Initiating MCMC...
i 1, 10000 steps, 8 ch; 11m48s, Mon Oct 16 19:09:09 2017. Next: ~1h48m.
i 2, 20000 steps, 8 ch; 11m50s, Mon Oct 16 19:21:00 2017. Next: ~1h36m.
i 3, 30000 steps, 8 ch; 11m51s, Mon Oct 16 19:32:51 2017. Next: ~1h24m.
i 4, 40000 steps, 8 ch; 11m50s, Mon Oct 16 19:44:42 2017. Next: ~1h12m.
i 5, 50000 steps, 8 ch; 11m49s, Mon Oct 16 19:56:31 2017. Next: ~1h0m.
i 6, 60000 steps, 8 ch; 11m51s, Mon Oct 16 20:08:23 2017. Next: ~48m58s.
i 7, 70000 steps, 8 ch; 11m50s, Mon Oct 16 20:20:14 2017. Next: ~37m7s.
i 8, 80000 steps, 8 ch; 11m50s, Mon Oct 16 20:32:04 2017. Next: ~25m16s.
i 9, 90000 steps, 8 ch; 11m52s, Mon Oct 16 20:43:57 2017. Next: ~13m24s.
i 10, 100000 steps, 8 ch; 11m53s, Mon Oct 16 20:55:50 2017. Checking now...
Loading chains...                                       [##########]  8/8
n = 100000
Saving information for histograms...                    [##########]  100%
Monitoring convergence...                               [##########]  100%
Total time for data analysis: 14.5 seconds.
After 1h58m and 100000 steps, with 8 chains,
Convergence 3.73e-01 achieved. Current time: Mon Oct 16 20:56:04 2017.
i 11, 110000 steps, 8 ch; 11m49s, Mon Oct 16 21:07:54 2017. Next: ~1h48m.
...

At any time after the first iteration, the user can inspect the acceptance ratio of the chains. The information after updated at every iteration and can be found in the file LCDM/MCMC-MyFirstRun/llc.txt.

Adaptation (beta)

In development. Come back later to check!

Analyzing the chains

The code reads the chains and compile the distributions for a nice plot with the command:

python analyze.py <FULL-OR-RELATIVE-PATH-TO>/LCDM/MCMC-MyFirstRun/

Histograms are generated for the marginalized distributions using 20 bins or any other number given with combobins=40, for example, if you think you have sufficient data. At any time one can run this command with calculate_R to check the state of convergence. By default, it will calculate \(\hat R^p - 1\) for twenty different sizes considering the size of the chains to provide an idea of the evolution of the convergence.

Convergence is assessed based on the Gelman-Rubin criteria. I will not enter into details of the method here, but I refer the reader to the original papers [1] [2] for more information. The variation of the original method for multivariate distribution is implemented. When using MCMC, all the chains are checked for convergence and the final resulting distribution which is analyzed and plotted is the concatenation of all the chains, since they are essentially all the same once they have converged.

Additional options

If for some reason you want to view the results for an intermediate point of the simulation, you can tell the script to stop_at=18000, everything will be analyzed until that point.

Depending on the model being analyzed, one may be interested in seeing the derived parameters. Once these are defined appropriately in the files derived.py and derived_bf.py, include them in the analysis with the option use_derived.

If you want to check the random walk of the chains you can plot the sequences with plot_sequences. This will make a grid plot containing all the chains and all parameters. Keep in mind that this can take some time and generate a big output file if the chains are very long. You can contour this problem by thinning the distributions by some factor thin=10. This also applies for the calculation of the correlation of the parameters in each chain, enabled with the ACF option.

We generally represent distributions by their histograms but sometimes we may prefer to exhibit smooth curves. Although it is possible to choose a higher number of histogram bins, this may not be sufficient and may required much more data. Much better (although possibly slow when there are too many states) are the kernel density estimates (KDE) tuned for Gaussian-like distributions [3]. To obtain smoothed shapes use smooth_hist=kde. This will compute KDE curves for the marginalized parameter distributions and also the two-parameter joint posterior probabilities. Try it for the 1D distributions only if it is taking too long, using kde1 instead. The option kde2 for the 2D distributions only is also available for completeness.

Making the triangle plots

The analyze.py routine will also produce the plots automatically (unless supressed with dontdraw), but you can always redraw everything when you want, maybe you would like to tweak some colors? Loading the chains again is not necessarily since the analysis already saves the information for the plots anyway. So, let’s say that we already saved this information for the chains after 20000 steps. Now we must pass the path to the folder containing the results for those specific size of chains and number of bins:

python drawhistograms.py <FULL-OR-RELATIVE-PATH-TO>/LCDM/MCMC-MyFirstRun/n20000/results20/

You can choose to view the best-fit point with bf=+ or any other marker. To change the color, use singlecolor=m (for magenta) or any other Python color name. The default is C0 (a shade of blue, from the default Matplotlib palette).

The \(1\sigma\) and \(2\sigma\) confidence levels are shown and are written to the file hist_table.tex inside the results20 folder. This is a \(\LaTeX\)-ready file that can be compiled to make a nice table in a PDF file or included in your paper as you want. To view and save information for more levels, use the argument levels=1,2,3, for example.

If you have used the option smooth_hist with either kde, kde1 or kde2 option you need to specify it again in order to make the corresponding plots, otherwise the histograms will be drawn.

Additional options

You can further tweak your plots and tables. usetex will make the program render the plot using \(\LaTeX\), fitting nicely to the rest of your paper. You can even choose the Times New Roman font if you prefer it over the default Computer Modern, using font=Times.

If you have nuisance parameters, you can opt to not show them with nonuisance.

fmt=3 can be used to set the number of figures to report the results (default is 5). You can plot Gaussian fits together with the histograms (the Gaussian curve and the two first sigma levels), using show_gaussian_fit, but probably only for the sake of comparison since this is not usually done in publications.

All these options can be given to the analyze.py command, they will be passed to drawhistograms.py accordingly.

Below we see the triangle plot of the histograms, with the default settings, in comparison with a perfected version using the smoothed distributions, the Python color C9, the \(\LaTeX\) renderer, including the best-fit point and the \(3\sigma\) confidence level. By default, the prior intervals are used as axes limits, with ticks at predefined positions, but both images below override this setting reading custom intervals and ticks given in the files LCDM.range and LCDM.ticks.

_images/LCDMdefault.png

Triangle plot with default configurations for histograms.

_images/LCDMperfected.png

Customized plot with smoothed distributions.

With use_derived, we can also check out the marginalized distributions of the derived parameters in the der_pars directory

_images/LCDMderived.png

Distributions of derived paramaters.

Combining two or more simulations in one plot

You just need to run drawhistograms.py with two or more paths in the arguments. To illustrate this, I run a second MCMC simulation for the same model but this time without including the distance priors data. It is then interesting to plot both realizations together so we can seethe effect that including that dataset has on the results:

python drawhistograms.py \
<FULL-OR-RELATIVE-PATH-TO>/LCDM-no-Planck/MCMC-MySecondRun/n32500/results20/ \
<FULL-OR-RELATIVE-PATH-TO>/LCDM/MCMC-MyFirstRun/n1300000/results20/ \
colors=C3,k smooth_hist=kde units=Orh2:5,Obh2:2 \
range=h:0.66_0.75,Obh2:0.01_0.056,Och2:0.08_0.16,M:-0.12_0.18 \
ticks=Och2:0.09_0.12_0.15,Obh2:1.8e-2_3.3e-2_4.8e-2,Orh2:4.7e-5_6e-5_7.3e-5 \
usetex font=Times gname=noplanck
_images/noplanck.png

Results for two different analyses.

Notice how crucial the CMB data is to resolve the degeneracy between the baryon and radiation densities when both are free parameters. The constraints are improved considerably.

You can combine as many results as you like. When this is done, any custom ranges or tick positions defined in .range and .ticks files will be ignored (as well as the factor in the .units file), since different simulations might give considerably different results that could go outside of the ranges of one of them. In this case, after checking the results and determining the ranges that you want to plot, pass this information in the command line as in the example above. The gname option specifies the prefix for the name of the pdf file that will be generated. If you omit this option you will be prompted to type it. All other settings are optional.

A legend will be included in the top right corner using the labels defined in the labels.txt files: the lines of these files are combined with a plus sign but you can use whatever label you prefer by replacing the contents of this file, for example, giving a name in a single line. The legend title uses the model name of the first simulation in the arguments. This is intended for showing, at the same time, results from different datasets with the same model. If this is not your case and you would like to use a different title, change accordingly the name entry in the .ini file of the corresponding simulation.

Visualizing chain sequences and convergence

If you include the argument plot_sequences you will get a plot like this

_images/LCDMsequences.png

Chain sequences along each parameter axis, for all chains.

When monitoring convergence, the values of \(\hat{R}^p - 1\) at twenty different lengths for the multivariate analysis and separate one-dimensional analyses for each parameter are plotted in the files monitor_convergence_<N>.pdf and monitor_each_parameter_<N>.pdf, where N is the total utilized length of the chains. The absolute values of the between-chains and within-chains variances, \(\hat{V}\) and \(W\) are also shown. For our \(\Lambda\text{CDM}\) example, we got

_images/LCDMconvergence.png

Multivariate convergence analysis.

_images/LCDMeachparameter.png

Individual parameter convergence monitoring.

Finally, the correlation of the chains can also be inspected if we use the option ACF. This includes all the cross- and auto-correlations.

_images/LCDMacf.png

Correlations in a chain.

[1]Gelman A & Rubin D. B. “Inference from Iterative Simulation Using Multiple Sequences”. Statistical Science 7 (1992) 457-472.
[2]Brooks S. P. & Gelman A. “General Methods for Monitoring Convergence of Iterative Simulations”. Journal of Computational and Graphical Statistics 7 (1998) 434.
[3]Kristan M., Leonardis A. and Skocaj D. “Multivariate online kernel density estimation with Gaussian kernels”. Pattern Recognit 44 (2011) 2630-2642.