A tool to compress Molecular Dynamics trajectories using Principal Components Analysis
This program suite allows the user to compress Molecular Dynamics (MD) trajectories using Principal Component Analysis (PCA) algorithms. This technique offers a good compression ratio at the expense of losing some precision in the trajectory.
One of the main shortcomings to popularize the use of Molecular Dynamics are its potentially large trajectory files. State-of-the-art simulation in the high nanosecond time scale can span easily several Gb, especially for large systems. Traditional general compression algorithms like LZW have been used in order to reduce the required space, but they usually do not perform well with this kind of data. However, trajectory data is not random. It follows patterns with well defined meaning that can be exploted for data compression. In particular, higher frequency movements can be discarded without affecting the overall dynamics of the system. Principal Component Analysis is one of the most used techniques to capture the essential movements of a macromolecular system. It implies a change in the coordinate space where reference eigenvectors are chosen according to the amount of system variance explained. The aim is to select the minimum number of reference coordinates that explain a given amount of system variance. The technique allows to select the degree of fidelity to the original trajectory. Chosing all eigenvectors there is no change in the accurancy of the trajectory. However, removing eigenvectors with the lowest amount of explained variance, has little effect on the overall behavior of the system, but has a remarkable effect on the size of the data.
Let's suppose we have a MD trajectory of N atoms and F frames. The first action is to prepare the input for the real processing and compression. Trajectory must be superimposed onto a representative structure. This would minimize the oscilations around the average structure and, hence, minimize the number of required eigenvectors. This action is performed in three steps:
The superimposition is performed looking for the best RMSd fit. This value can be computed using two different algorithms:
The gaussian RMSd algorithm may help to reduce the number of eigenvectors needed for a given compression, thus reducing the size of the compressed file. The gaussian RMSd algorithm also allows for other analysis, like the hinge point prediction, much more difficult and imprecise using a standard RMSd algorithm.
The first step is to compute the covariance matrix of the trajectory, where the random variables are the coordinates of the N atoms. This leads to a symmetric square matrix of 3N*3N dimensions. The matrix is then diagonalized to get the associated eigenvalues and eigenvectors. The sum of all the eigenvalues is the total variance of the trajectory, and the individual eigenvalues are the variance quantity explained by the corresponding eigenvector. Using this data the number eigenvectors that explain the desired percentage of the total variance can be selected (NV). A higher number of eigenvector implies a more accurate representation of the original trajectory, but leads to a lower compression rate.
Once the eigenvectors have been selected, coordinated of the original F frames are projected onto the new coordinate space. The final output of the algorithm contains the average structure, the eigenvectors and the calculated projections. The size of stored coordinates is reduced from F*3N to F*3NV. PCA suite also stores other information, like the eigenvalues and the atom names to allow to perform the analysis and manipulations in a quick and more flexible way.
Through the evolution of this code, the format used by the compressed files has changed along time. PCA suite works natively with the PCZ4 format, but it also supports files stored in PCZ2 and PCZ3 formats. All the formats are binary-based, being the PCZ2 and PCZ3 very similar and PCZ4 adding new features.
The most important improvement in PCZ4 is the inclusion of the atom names that makes possible to use masks to perform the actions on set of selected atoms.
PCAsuite is portable and can be compiled for any usual architecture. We also provide statically precompiled binaries for the most popular systems and architectures.
These binaries have been statically compiled against gFORTRAN, NetCDF and LAPACK. Redistribution of the binaries is subject to the availability of the source code and the reproduction of their LICENSE files.
System requirements:
gcc
In the first place, you need to download the tarball from pcasuite.tar.gz and extract it.
Prior to executing the compilation choose a compiler and adjust the Makefile with the proper flags. This is done through the config.mk
file. This file is a soft link to a file with the proper flags. Some files adapted to different compilers are provided:
config.gcc3
for the GCC v3 compiler seriesconfig.gcc4
for the GCC v4 compiler seriesconfig.intel
for the Intel compilersconfig.xlc
for the IBM xl compilersBy default, config.mk
is linked with GCC v4. Some adjustments may be required in order to have the code compiled if the libraries required are not stored in standard paths.
In summary, the steps needed to compile the application are:
config.mk
as needed. For example, you might want to edit the CFLAGS
to add new include or library directories (flags -I
and -L
)This procedure compiles the source code and leave the binaries in the same folder. This binaries can be moved to a proper place to be executed easily.
This tool is the main compression engine. It performs the steps outlined in the Operation section. This tool reads the trajectory, recenters the frames, computes the covariance matrix, computes the eigenvectors, the projections for the trajectory onto the new coordinate space and finally writes the compressed file.
Input files are the trajectory to be compressed, and a PDB fiile matching the simulated structure. The required syntax is:
$ pcazip -i <input_trajectory> -p <input_PDB_file> -o <output_file> [options]
Complete list of supported options:
Examples of use:
This tool serves the purpose of reconstructing the trajectory from the compressed data. It works by retrieving the eigenvectors, the associate projections, and operating with them until we recover the original trajectory. This is the complete list of supported options:
Examples of use:
This is the tool used to analyze and query the data stored inside the compressed file. It allows to query for the values stored directly in the file and also to compute other values based on the stored ones.
The information that can be retrieved with this tool are:
The supported options are:
Examples of use:
Mask strings have been implemented to allow the users to specify the atoms that must be used in a compression or analysis with ease, concisely and without having to generate more files. They have a syntax very similar to that used by Amber utilities but slightly different to allow for greater flexibility.
The masks are composed of atom and residue specifications and the connectors between them. Atoms are preceded by @ and residues by :. After one of those symbols comes the atom/residue specification. It can come in numerical form, giving the atom or residue number, or in alphabetical form, giving the atom or residue name. Multiple specifications can be separated by commas, and when using numeric specifications, ranges can be created by using dashes. Some examples:
@C
represents all the Carbon atoms:GLY
represents all the glycine residues:1-10
represents the first 10 residues@C,CA,N,O
represents all the atoms belonging to the backbone@2-5,7,9,15-30
represents all the atoms numbered from 2 to 5, 7, 9 and from 15 to 30All of this specifications can be combined by means of logical constructions and parenthesis. Logical AND (&
), logical OR (|
) and logical NOT (~
) can be used and combined. With this constructions we can ask for the residue 3 and the atom 300 but not the atom 150 with this line: :3&@300&~@150
The is also the chance of using wildcards to complete the names of atoms and residues, so we can select all the Hydrogens with @H*
. More examples follow:
:10-20&~@CA,C,N,O
represents the sidechains of the residues 10 thru 20:GLY|(@O*,N*&~:GLU)
represents all the GLY residues along with all the Oxygen and Nitrogen atoms that does not belong to GLU residue@O*&~:GLY
represents all the Oxygen atoms that does not belong to a Glycine