This program suite allows the user to compress Molecular Dynamics (MD) trajectories using Principal Component Analysis (PCA) algorithms. This technique offers a good compression ratio at the expense of losing some precision in the trajectory.
People working with Molecular Dynamics simulations know that the output files of their work are huge. But they need to save their work for later study and available space disappears quickly. Traditional general compression algorithms like LZW has been used in order to reduce the required space, but they usually does not perform well with this kind of data. But we know that this data is no random, it follows a pattern and has a well defined meaning which we can exploit in our effort to reduce the size of the data. We also know that most of the time, we does not need to store all the details. Higher frequency movements can be due to temperature or other irrelevant factors for some kinds of analysis.
This knowledge points us towards the use of PCA techniques. This methods try to change the coordinate space of the system being analyzed to capture the maximum amount of variance with the minimum effort. This methods also allows us to select the degree of fidelity to the original trajectory that we need, so if we need a very accurate trajectory, we can choose not to compress so much the data in order to have more detail.
Principal Component Analysis is a technique used to reduce the dimensionality of a dataset. It is an orthogonal linear transformation that transforms data into a new coordinate system such that the greatest variance by any projection of the data lays on the first coordinate, the second greatest variance lays on the second and so on. When we get the first coordinates, we are getting the movements that contribute most to the variance of that dataset. In that case, the least coordinates can be ignored because their contribution is very low, hence we store less information but the analysis can be performed in the same way.
There are different ways to apply this technique. I will explain the method used by this suite.
Let's suppose we have a MD trajectory of N atoms and F frames. The first action is to prepare the input for the real processing and compression. We must recenter the different snapshots onto a representative structure in order to have a good compression. This action is performed in three steps:
The recentering is performed looking for the best RMSd fit. This value can be computed using two different algorithms:
The gaussian algorithm may help to reduce the number of eigenvectors needed for a given compression, thus reducing the size of the compressed file. The gaussian algorithm also allows for other analysis, like the hinge point prediction, much more difficult and imprecise using a standard RMSd algorithm.
The first step is to compute the covariance matrix of the trajectory, where the random variables are the coordinates of the N atoms. This leads to a symmetric square matrix of 3N*3N dimensions. Once we have this matrix we diagonalize it in order to get the associated eigenvalues and eigenvectors. It turns out that the sum of all the eigenvalues is the total variance of the trajectory, and the individual eigenvalues is the variance quantity explained by the corresponding eigenvector.
With this data we can select and save enough eigenvectors to preserve an arbitrary amount of variance and discard the ones that give us little to no information.
Once the eigenvectors has been selected, a projection of the original data into the new coordinate space must be performed. The final output of the algorithm must contain the mean structure, the eigenvectors and the projections in order to restore the original file. Our suite also stores some other kind of information like the eigenvalues and the atom names to allow us to perform the analysis and manipulations in a quick and more flexible way.
Through the evolution of this code, the format used by the compressed files has changed along time. This suite works natively with the PCZ4 format, but it also supports reading of files written in PCZ2 and PCZ3 formats. All the formats are binary-based, being the PCZ2 and PCZ3 very similar and PCZ4 adding new features.
The most important difference between PCZ2/3 and PCZ4 is the lack of the atom names in the former. This makes impossible to use masks to select atoms.
The code is quite standard so no difficult steps will be needed. In the
first place, you need to download the tarball from
pcasuite.tar.gz.
Once we have the tarball, we must uncompress it, enter into the newly
created pcazip
folder and compile the source with
make
.
Prior to executing the compilation we must choose a compiler and adjust
the Makefile with the proper flags. This is done through the
config.mk
file. This file is a soft link to a file with the
proper flags. Some files for different compilers are provided:
config.gcc3
for the GCC v3 compiler seriesconfig.gcc4
for the GCC v4 compiler seriesconfig.intel
for the Intel compilersconfig.xlc
for the IBM xl compilersThere is a default link to compile with GCC v4. In summary, the steps needed to compile the application are:
This procedure compiles the source code and leave the binaries in the same folder. This binaries can be moved to a proper place to be executed easily.
Some adjustment may be required in order to have the code compiled if the
libraries required are not stored in standard paths. A local
config.mk
file can be generated to adapt the compiler to the
computer where the suite is being compiled.
This tool is the main compression engine. It performs the steps outlined in the Operation section. This tool reads the trajectory, recenters the frames, computes the covariance matrix, computes the eigenvectors, the projections for the trajectory onto the new coordinate space and finally writes the compressed file.
Its operation mode is quite simple, just needing the input trajectory, a PDB to get the atom names from and some CPU time. The syntax is the next: $ pcazip -i <input_trajectory> -p <input_PDB_file> -o <output_file> [options]
Complete list of supported options:
Examples of use:
This tool serves the purpose of reconstructing the trajectory from the compressed data. It works by retrieving the eigenvectors, the associate projections, and operating with them until we accumulate the original data. This tool is quite simple and easy to use, and mimics the parameters of the other tools in the suite. This is the complete list of supported options:
Examples of use:
This is the tool used when we need to analyze and query the data stored inside the compressed file. It allows the user to query for the values stored directly in the file and also to compute other values based on the stored ones.
The information that can be retrieved with this tool are:
The supported options are:
Examples of use:
Mask strings has been implemented to allow the users to specify the atoms that must be used in a compression or analysis with ease, concisely and without having to generate more files. They have a syntax very similar to that used by Amber utilities but slightly different to allow for greater flexibility.
The masks are composed of atom and residue specifications and the connectors between them. Atoms are preceded by @ and residues by :. After one of those symbols comes the atom/residue specification. It can come in numerical form, giving the atom or residue number, or in alphabetical form, giving the atom or residue name. Multiple specifications can be separated by commas, and when using numeric specifications, ranges can be created by using dashes. Some examples:
@C
represents all the Carbon atoms:GLY
represents all the glycine residues:1-10
represents the first 10 residues@C,CA,N,O
represents all the atoms belonging to the
backbone@2-5,7,9,15-30
represents all the atoms numbered from 2
to 5, 7, 9 and from 15 to 30
All of this specifications can be combined by means of logical
constructions and parenthesis. Logical AND (&
), logical
OR (|
) and logical NOT (~
) can be used and
combined. With this constructions we can ask for the residue 3 and the
atom 300 but not the atom 150 with this line:
:3&@300&~@150
The is also the chance of using wildcards to complete the names of atoms
and residues, so we can select all the Hydrogens with @H*
.
More examples follows:
:10-20&~@CA,C,N,O
represents the sidechains of the
residues 10 thru 20:GLY|(@O*,N*&~:GLU)
represents all the GLY residues
along with all the Oxygen and Nitrogen atoms that does not belong to GLU
residue@O*&~:GLY
represents all the Oxygen atoms that does
not belong to a GlycinePCAzip is the most time-consuming application of the suite. The more consuming routines of the algorithm are the covariance matrix generation and the eigenvector calculus. This routines has been targeted for parallelization witn MPI, the standard paradigm for distributed memory parallelization of code.
The paralellization of the eigenvector computation has been achieved through the use of the ScaLAPACK library, which is a highly optimized and portable library written in Fortran. Most hardware vendors offer their own optimized version for its own machines. The parallelization method of ScaLAPACK has been through MPI for distributed memory machines, which has conditioned the decision of using MPI for the parallelization of the covariance matrix calculus.
The parallelization of the covariance matrix computation has been done with MPI because the ScaLAPACK library use this methodology, but also because is a more scalable technology than the ones for shared memory machines like OpenMP. Using shared memory machines we are limited to use only the processors physically present on the machine. No connection to other networked processors can be done. Using a distributed memory approach this limitation is overcomed and remote processors can be used, for example, the ones of a computer cluster. The parallelization effort is greater in the message passing paradigm, but it compensates by its scalability.
After benchmarking the resulting code, we can see that the covariance matrix calculation scales really well, although the eigenvector calculation shows a poorer performance. This means that we must choose carefully the number of processors in order not to waste resources. If the matrix covariance calculus is expected to be heavier than the diagonalization, then more processors can be added without fear. But if the diagonalization process is the most time-consuming, then we must adjust the number of processors for a good efficiency of this step.
The source code of PCAzip utility has been prepared to support multiple input file formats. This has been achieved through a file recognition process. It is very easy to add a new file format to the compressor, modifying only one file and generating another.
The core of the file recognition engine are the methods in the
traj_io.c
file.
When a file is opened with trajopen
method, different input
modules are tried in order until we find a module that accepts this kind
of file.
The methods used to identify the files are stored in function pointer arrays:
trjopen
trjsnap
trjFormatOK
trjclose
Each array contains a pointer to the methods used to open, read, identify or close a format. This means that you must provide a new C code file that provides the code needed to open a file with the format you are interested in, the code needed to read a snapshot from a previously opened file, the code needed to identify a file with the format and the code needed to close a file with the format.
Examples of this methods can be found in the binpos_io.c
file, which contains the code needed to read binpos binary files.