4.4. Running MCMC¶
This is the vanilla Monte Carlo Markov Chain with the Metropolis algorithm, as introduced in the previous section The Metropolis-Hastings sampler.
We will now proceed to run MCMC for the \(\Lambda\text{CDM}\) model.
All the following commands must be issued from your terminal at the EPIC
directory.
Initializing the chains¶
Standard MCMC is the default option for sampling. The chains are initialized with:
python initialize_chains.py LCDM.ini
We already know the .ini
file, a configuration file containing all the
information for that specific simulation.
It defines the parameters names and priors, their TeX strings, the type of
cosmological model, which will determine how to calculate distances and the
observables and sets a label for the model.
The code will ask the number of chains to be used. It must be at least 2
.
You can also run the command above with chains=8
as an argument to suppress
this prompt.
Each chain will create an independent process with Python’s
multiprocessing
, so you should not choose a
number higher than the number of CPUs multiprocessing.cpu_count()
of your
machine.
This command will create a folder with the files that will store the states of
the chains, named reducedchain1.txt
, reducedchain2.txt
, etc, besides
other relevant files.
The sampling mode (MCMC
or PT
) is prepended to the name of the folder,
which is always the date-time of creation (in UTC), unless the option
cfullname=MyFirstRun
(c standing for custom) is used, then
MCMC-MyFirstRun
will be the folder name.
A custom label can also be prepended with cname=my_label
creating, for
example, the folder my_label-MCMC-171110-173000
. It will be stored within
another folder with the same name of the .ini
file, thus in this case
LCDM/my_label-MCMC-170704-201125
.
The folder name will be displayed in the first line of the output.
Alternatively, this first command can be skipped by running the main file
epic.py
directly, as I explain next.
The main file can distinguish if the first argument is a .ini
file or a
directory.
In the first case, it will call initialize_chains.py
with the same
arguments.
In the second case, if the path given corresponds to a MCMC
or PT
(signaled by the mode.txt
file) directory just created with
initialize_chains.py
or any previously initiated simulation, it will
continue to evolve its chains from where they stopped.
Sampling the posterior distribution¶
The magic starts with:
python epic.py <FULL-PATH-TO>/LCDM/MCMC-MyFirstRun/
or simply:
python epic.py LCDM.ini
if you have not initialized the chains yet with the previous command or want to start a new run. Two parameters of the code will be prompted:
Please specify the number of steps in each passage in the MCMC loop: 10000
Please specify the tolerance for convergence: 1e-2
but these can also be informed directly when executing epic.py
in the form
of arguments steps=3000
and tol=1e-2
.
This number of steps means that the chains will be written to the disk (the new
states are appended to the chains files) after each steps
states in all
chains.
A large number prevents frequent writing operations, which could otherwise
affect overall performance unnecessarily.
The relevant information will be displaying, in our example case looking
similar to the following:
Loading INI file...
Reading simplified JLA dataset...
jla_likelihood_v4
Reading distance priors from Planck 2015...
Planck2015_LCDM
Reading BAO (6dF+SDSS+BOSS+Lyalpha+WiggleZ) data...
BAO-6dF+SDSS+BOSS+Lyalpha+WiggleZ.txt
Reading cosmic chorometer H(z) data...
thirtypointsHz.txt
File <FULL-PATH-TO>/LCDM/MCMC-MyFirstRun/
and the MCMC will start.
In the MCMC mode, the code will periodically check for convergence according
to the Gelman-Rubin method (by default it is done every two hours but can be
specified differently as 12h
or 45min
in the arguments, for example.
Ideally this does not need to be a small time interval, but the option of
specifying this time in minutes or even in seconds (30sec
) is implemented
and available for testing purposes.
The MCMC run stops if convergence is achieved with a tolerance smaller than
tol
.
The following is the output of our example after the MCMC has started.
The 10000 steps take a bit less than twelve minutes in my workstation running 8
chains in parallel. The number of chains will not make much impact on this
unless we use too many steps by iteration and are close to the limit of memory.
After approximately two hours, convergence is checked. Since it is smaller than
our required tol
, the code continues with new iterations for more two hours
before checking convergence again and so on. When convergence smaller than
tol
is achieved the code makes the relevant plots and quits.
Initiating MCMC...
i 1, 10000 steps, 8 ch; 11m48s, Mon Oct 16 19:09:09 2017. Next: ~1h48m.
i 2, 20000 steps, 8 ch; 11m50s, Mon Oct 16 19:21:00 2017. Next: ~1h36m.
i 3, 30000 steps, 8 ch; 11m51s, Mon Oct 16 19:32:51 2017. Next: ~1h24m.
i 4, 40000 steps, 8 ch; 11m50s, Mon Oct 16 19:44:42 2017. Next: ~1h12m.
i 5, 50000 steps, 8 ch; 11m49s, Mon Oct 16 19:56:31 2017. Next: ~1h0m.
i 6, 60000 steps, 8 ch; 11m51s, Mon Oct 16 20:08:23 2017. Next: ~48m58s.
i 7, 70000 steps, 8 ch; 11m50s, Mon Oct 16 20:20:14 2017. Next: ~37m7s.
i 8, 80000 steps, 8 ch; 11m50s, Mon Oct 16 20:32:04 2017. Next: ~25m16s.
i 9, 90000 steps, 8 ch; 11m52s, Mon Oct 16 20:43:57 2017. Next: ~13m24s.
i 10, 100000 steps, 8 ch; 11m53s, Mon Oct 16 20:55:50 2017. Checking now...
Loading chains... [##########] 8/8
n = 100000
Saving information for histograms... [##########] 100%
Monitoring convergence... [##########] 100%
Total time for data analysis: 14.5 seconds.
After 1h58m and 100000 steps, with 8 chains,
Convergence 3.73e-01 achieved. Current time: Mon Oct 16 20:56:04 2017.
i 11, 110000 steps, 8 ch; 11m49s, Mon Oct 16 21:07:54 2017. Next: ~1h48m.
...
At any time after the first iteration, the user can inspect the acceptance ratio
of the chains. The information after updated at every iteration and can be found
in the file LCDM/MCMC-MyFirstRun/llc.txt
.
Adaptation (beta)¶
In development. Come back later to check!
Analyzing the chains¶
The code reads the chains and compile the distributions for a nice plot with the command:
python analyze.py <FULL-OR-RELATIVE-PATH-TO>/LCDM/MCMC-MyFirstRun/
Histograms are generated for the marginalized distributions using 20 bins or
any other number given with combobins=40
, for example, if you think you
have sufficient data.
At any time one can run this command with calculate_R
to check the state of
convergence. By default, it will calculate \(\hat R^p - 1\) for twenty
different sizes considering the size of the chains to provide an idea of the
evolution of the convergence.
Convergence is assessed based on the Gelman-Rubin criteria.
I will not enter into details of the method here, but I refer the reader to the
original papers [1] [2] for more information.
The variation of the original method for multivariate distribution is implemented.
When using MCMC
, all the chains are checked for convergence and the final
resulting distribution which is analyzed and plotted is the concatenation of
all the chains, since they are essentially all the same once they have
converged.
Additional options¶
If for some reason you want to view the results for an intermediate point of
the simulation, you can tell the script to stop_at=18000
, everything will
be analyzed until that point.
Depending on the model being analyzed, one may be interested in seeing the
derived parameters. Once these are defined appropriately in the files
derived.py
and derived_bf.py
, include them in the analysis with the
option use_derived
.
If you want to check the random walk of the chains you can plot the sequences
with plot_sequences
. This will make a grid plot containing all the chains
and all parameters. Keep in mind that this can take some time and generate a
big output file if the chains are very long. You can contour this problem by
thinning the distributions by some factor thin=10
.
This also applies for the calculation of the correlation of the parameters in
each chain, enabled with the ACF
option.
We generally represent distributions by their histograms but sometimes we may
prefer to exhibit smooth curves. Although it is possible to choose a higher
number of histogram bins, this may not be sufficient and may required much more
data.
Much better (although possibly slow when there are too many states) are the
kernel density estimates (KDE) tuned for Gaussian-like distributions [3].
To obtain smoothed shapes use smooth_hist=kde
. This will compute KDE curves
for the marginalized parameter distributions and also the two-parameter joint
posterior probabilities. Try it for the 1D distributions only if it is taking
too long, using kde1
instead. The option kde2
for the 2D distributions
only is also available for completeness.
Making the triangle plots¶
The analyze.py
routine will also produce the plots automatically (unless
supressed with dontdraw
), but you can always redraw everything when you
want, maybe you would like to tweak some colors? Loading the chains again is
not necessarily since the analysis already saves the information for the plots
anyway.
So, let’s say that we already saved this information for the chains after 20000
steps. Now we must pass the path to the folder containing the results for
those specific size of chains and number of bins:
python drawhistograms.py <FULL-OR-RELATIVE-PATH-TO>/LCDM/MCMC-MyFirstRun/n20000/results20/
You can choose to view the best-fit point with bf=+
or any other marker. To
change the color, use singlecolor=m
(for magenta) or any other Python color
name. The default is C0
(a shade of blue, from the default Matplotlib
palette).
The \(1\sigma\) and \(2\sigma\) confidence levels are shown and are
written to the file hist_table.tex
inside the results20
folder. This is
a \(\LaTeX\)-ready file that can be compiled to make a nice table in a PDF
file or included in your paper as you want.
To view and save information for more levels, use the argument
levels=1,2,3
, for example.
If you have used the option smooth_hist
with either kde
, kde1
or
kde2
option you need to specify it again in order to make the corresponding
plots, otherwise the histograms will be drawn.
Additional options¶
You can further tweak your plots and tables. usetex
will make the program
render the plot using \(\LaTeX\), fitting nicely to the rest of your paper.
You can even choose the Times New Roman font if you prefer it over the default
Computer Modern, using font=Times
.
If you have nuisance parameters, you can opt to not show them with nonuisance
.
fmt=3
can be used to set the
number of figures to report the results (default is 5
).
You can plot Gaussian fits together with the histograms (the Gaussian curve and
the two first sigma levels), using show_gaussian_fit
, but probably only for
the sake of comparison since this is not usually done in publications.
All these options can be given to the analyze.py
command, they will be
passed to drawhistograms.py
accordingly.
Below we see the triangle plot of the histograms, with the default settings,
in comparison with a perfected version using the smoothed distributions, the
Python color C9
, the \(\LaTeX\) renderer, including the best-fit point
and the \(3\sigma\) confidence level. By default, the prior intervals are
used as axes limits, with ticks at predefined positions, but both images below
override this setting reading custom intervals and ticks given in the files
LCDM.range
and LCDM.ticks
.
With use_derived
, we can also check out the marginalized distributions of
the derived parameters in the der_pars
directory
Combining two or more simulations in one plot¶
You just need to run drawhistograms.py
with two or more paths in the
arguments. To illustrate this, I run a second MCMC simulation for the same
model but this time without including the distance priors data. It is then
interesting to plot both realizations together so we can seethe effect that
including that dataset has on the results:
python drawhistograms.py \
<FULL-OR-RELATIVE-PATH-TO>/LCDM-no-Planck/MCMC-MySecondRun/n32500/results20/ \
<FULL-OR-RELATIVE-PATH-TO>/LCDM/MCMC-MyFirstRun/n1300000/results20/ \
colors=C3,k smooth_hist=kde units=Orh2:5,Obh2:2 \
range=h:0.66_0.75,Obh2:0.01_0.056,Och2:0.08_0.16,M:-0.12_0.18 \
ticks=Och2:0.09_0.12_0.15,Obh2:1.8e-2_3.3e-2_4.8e-2,Orh2:4.7e-5_6e-5_7.3e-5 \
usetex font=Times gname=noplanck
Notice how crucial the CMB data is to resolve the degeneracy between the baryon and radiation densities when both are free parameters. The constraints are improved considerably.
You can combine as many results as you like.
When this is done, any custom ranges or tick positions defined in .range
and .ticks
files will be ignored (as well as the factor in the .units
file), since different simulations might give considerably different results
that could go outside of the ranges of one of them.
In this case, after checking the results and determining the ranges that you
want to plot, pass this information in the command line as in the example
above.
The gname
option specifies the prefix for the name of the pdf file that
will be generated.
If you omit this option you will be prompted to type it.
All other settings are optional.
A legend will be included in the top right corner using the labels defined in
the labels.txt
files: the lines of these files are combined with a plus
sign but you can use whatever label you prefer by replacing the contents of
this file, for example, giving a name in a single line.
The legend title uses the model name of the first simulation in the arguments.
This is intended for showing, at the same time, results from different datasets
with the same model. If this is not your case and you would like to use a
different title, change accordingly the name entry in the .ini
file of the
corresponding simulation.
Visualizing chain sequences and convergence¶
If you include the argument plot_sequences
you will get a plot like this
When monitoring convergence, the values of \(\hat{R}^p - 1\) at twenty
different lengths for the multivariate analysis and separate one-dimensional
analyses for each parameter are plotted in the files
monitor_convergence_<N>.pdf
and monitor_each_parameter_<N>.pdf
, where
N
is the total utilized length of the chains. The absolute values of the between-chains and within-chains variances, \(\hat{V}\) and \(W\) are also shown.
For our \(\Lambda\text{CDM}\) example, we got
Finally, the correlation of the chains can also be inspected if we use the option ACF
. This includes all
the cross- and auto-correlations.
[1] | Gelman A & Rubin D. B. “Inference from Iterative Simulation Using Multiple Sequences”. Statistical Science 7 (1992) 457-472. |
[2] | Brooks S. P. & Gelman A. “General Methods for Monitoring Convergence of Iterative Simulations”. Journal of Computational and Graphical Statistics 7 (1998) 434. |
[3] | Kristan M., Leonardis A. and Skocaj D. “Multivariate online kernel density estimation with Gaussian kernels”. Pattern Recognit 44 (2011) 2630-2642. |