PCAsuite

A tool to compress Molecular Dynamics trajectories using Principal Components Analysis

Abstract
Overview
Description
File formats
Download and installation instructions
Execution
Mask Strings

1. Abstract

This program suite allows the user to compress Molecular Dynamics (MD) trajectories using Principal Component Analysis (PCA) algorithms. This technique offers a good compression ratio at the expense of losing some precision in the trajectory.

2. Overview

One of the main shortcomings to popularize the use of Molecular Dynamics are its potentially large trajectory files. State-of-the-art simulation in the high nanosecond time scale can span easily several Gb, especially for large systems. Traditional general compression algorithms like LZW have been used in order to reduce the required space, but they usually do not perform well with this kind of data. However, trajectory data is not random. It follows patterns with well defined meaning that can be exploted for data compression. In particular, higher frequency movements can be discarded without affecting the overall dynamics of the system. Principal Component Analysis is one of the most used techniques to capture the essential movements of a macromolecular system. It implies a change in the coordinate space where reference eigenvectors are chosen according to the amount of system variance explained. The aim is to select the minimum number of reference coordinates that explain a given amount of system variance. The technique allows to select the degree of fidelity to the original trajectory. Chosing all eigenvectors there is no change in the accurancy of the trajectory. However, removing eigenvectors with the lowest amount of explained variance, has little effect on the overall behavior of the system, but has a remarkable effect on the size of the data.

3. Description

Let's suppose we have a MD trajectory of N atoms and F frames. The first action is to prepare the input for the real processing and compression. Trajectory must be superimposed onto a representative structure. This would minimize the oscilations around the average structure and, hence, minimize the number of required eigenvectors. This action is performed in three steps:

Superimpose all the snapshots of the trajectory onto the first one
Compute the average snapshot after the first step
Superimpose all the snapshots onto the computed average

The superimposition is performed looking for the best RMSd fit. This value can be computed using two different algorithms:

A standard, Kabsch-based, RMSd algorithm
A Gaussian RMSd algorithm that gives more weight to the still atoms and less weight to the moving atoms

The gaussian RMSd algorithm may help to reduce the number of eigenvectors needed for a given compression, thus reducing the size of the compressed file. The gaussian RMSd algorithm also allows for other analysis, like the hinge point prediction, much more difficult and imprecise using a standard RMSd algorithm.

The first step is to compute the covariance matrix of the trajectory, where the random variables are the coordinates of the N atoms. This leads to a symmetric square matrix of 3N*3N dimensions. The matrix is then diagonalized to get the associated eigenvalues and eigenvectors. The sum of all the eigenvalues is the total variance of the trajectory, and the individual eigenvalues are the variance quantity explained by the corresponding eigenvector. Using this data the number eigenvectors that explain the desired percentage of the total variance can be selected (NV). A higher number of eigenvector implies a more accurate representation of the original trajectory, but leads to a lower compression rate.

Once the eigenvectors have been selected, coordinated of the original F frames are projected onto the new coordinate space. The final output of the algorithm contains the average structure, the eigenvectors and the calculated projections. The size of stored coordinates is reduced from F*3N to F*3NV. PCA suite also stores other information, like the eigenvalues and the atom names to allow to perform the analysis and manipulations in a quick and more flexible way.

4. File formats

Through the evolution of this code, the format used by the compressed files has changed along time. PCA suite works natively with the PCZ4 format, but it also supports files stored in PCZ2 and PCZ3 formats. All the formats are binary-based, being the PCZ2 and PCZ3 very similar and PCZ4 adding new features.

The most important improvement in PCZ4 is the inclusion of the atom names that makes possible to use masks to perform the actions on set of selected atoms.

5. Download and installation

PCAsuite is portable and can be compiled for any usual architecture. We also provide statically precompiled binaries for the most popular systems and architectures.

5.a. Precompiled binaries

Precompiled binaries include all necessary libraries in the executable file, and have been prepared and tested in the most common computer architectures. To discover your machine's architecture for the linux version, please open a terminal and type 'uname -m'.

These binaries have been statically compiled against gFORTRAN, NetCDF and LAPACK. Redistribution of the binaries is subject to the availability of the source code and the reproduction of their LICENSE files.

5.b. From source code

System requirements:

A C compiler and its standard libraries. Most systems already provide a compiling environment, for example, gcc
A FORTRAN compiler (gFORTRAN, for instance) and its libraries.
The NetCDF libraries
The LAPACK libraries. The latest versions also include BLAS, make sure to run 'make blaslib' to compile it

In the first place, you need to download the tarball from pcasuite.tar.gz and extract it.

Prior to executing the compilation choose a compiler and adjust the Makefile with the proper flags. This is done through the config.mk file. This file is a soft link to a file with the proper flags. Some files adapted to different compilers are provided:

config.gcc3 for the GCC v3 compiler series
config.gcc4 for the GCC v4 compiler series
config.intel for the Intel compilers
config.xlc for the IBM xl compilers

By default, config.mk is linked with GCC v4. Some adjustments may be required in order to have the code compiled if the libraries required are not stored in standard paths.

In summary, the steps needed to compile the application are:

$ wget http://mmb.pcb.ub.es/software/pcasuite.tar.gz
$ tar xf pcasuite.tar.gz
$ cd pcasuite
$ rm config.mk
$ ln -s config.gcc4 config.mk
Edit config.mk as needed. For example, you might want to edit the CFLAGS to add new include or library directories (flags -I and -L)
$ make

This procedure compiles the source code and leave the binaries in the same folder. This binaries can be moved to a proper place to be executed easily.

6. Execution

6.1. PCAzip

This tool is the main compression engine. It performs the steps outlined in the Operation section. This tool reads the trajectory, recenters the frames, computes the covariance matrix, computes the eigenvectors, the projections for the trajectory onto the new coordinate space and finally writes the compressed file.

Input files are the trajectory to be compressed, and a PDB fiile matching the simulated structure. The required syntax is:

$ pcazip -i <input_trajectory> -p <input_PDB_file> -o <output_file> [options]

Complete list of supported options:

-i <input_file>: Specifies the name of the file containing the trajectory in an Amber-like format
-o <output_file>: Specifies the name of the file that will store the compressed trajectory
-p <pdb_file>: Specifies the name of a PDB file matching the simulated systems. Only atom names are relevant, coordinates in this PDB file are discarded,
-n <number_of_atoms>: Specifies how many atoms the trajectory has. It should be used when no PDB file is specified. Functionalities that depends on the atoms and residue names will be disabled
-m <mask_file>: Specifies the name of a PDB file containing the atoms that should be taken into account when compressing the trajectory. Only this atoms will be used in the compression process and only this atoms will be in the output file
-M <mask_string>: Specifies a mask string for the selection of the atoms. It serves the same purpose than the -m switch, but in works in a descriptive way. The mask has a similar syntax to Amber/ptraj masks
-e <number_of_eigenvectors>: Specifies the number of eigenvectors that must be stored in the file
-q <quality>: Specifies the quality of the compression as a percentage of explained variance (default 90%)
-g: If specified, the protein recentering is performed with a gaussian version of the RMSd algorithm. If not specified, a standard RMSd algorithm is used
-v: Makes the program more verbose, giving more information about its progress
-h: Displays a short help for the user

Examples of use:

Compress a trajectory, including atom name information, with a 90% of quality: $ pcazip -i traj.x -p traj.pdb -o traj.pcz
Compress a trajectory, without atom name information, with a 95% of quality: $ pcazip -i traj.x -n numberOfAtoms -o traj.pcz -q 95
Compress a trajectory, including atom name information, using the gaussian RMSd and taking the first 20 eigenvectors: $ pcazip -i traj.x -p traj.pdb -o traj.pcz -e 20 -g
Compress the backbone of a trajectory, including atom name information, and asking for a verbose output: $ pcazip -i traj.x -p traj.pdb -o traj.pcz -v -M @C,CA,N,O

6.2. PCAunzip

This tool serves the purpose of reconstructing the trajectory from the compressed data. It works by retrieving the eigenvectors, the associate projections, and operating with them until we recover the original trajectory. This is the complete list of supported options:

-i <input_file>: Specifies the name of the file containing the compressed trajectory
-o <output_file>: Specifies the name of the file that will store the uncompressed trajectory in an Amber-like format
--pdb: If this flag is present, the output format is PDB. It is needed that the compression was made giving a PDB in the compression process
-v: Makes the program more verbose, giving more information about its progress
-h: Displays a short help for the user

Examples of use:

Uncompress a trajectory to an Amber-like file: $ pcaunzip -i traj.pcz -o traj.x
Uncompress a trajectory to a PDB-like file: $ pcaunzip -i traj.pcz -o traj.x --pdb

6.3. PCZdump

This is the tool used to analyze and query the data stored inside the compressed file. It allows to query for the values stored directly in the file and also to compute other values based on the stored ones.

The information that can be retrieved with this tool are:

Title of the trajectory
Number of atoms, eigenvectors and frames
Total and explained variance
Percentual quality of the compression
Dimensionality
RMSd type used in the compression
Whether or no the file contains atom names
Average structure
Eigenvalues
Eigenvectors
Projections
RMSd between frames
Atomic fluctuations and B-factors
Animations along an eigenvector
Lindemann coefficient
Indexes of collectivity movement
Force constants
Hinge points predictions

The supported options are:

-i <input_file>: Specifies the name of the file containing the compressed trajectory
-o <output_file>: Specifies the name of the file that will store the output of the query
--info: Returns basic information of the file. It gives the title of the trajectory, the number of atoms, eigenvectors and frames, the total and explained variance, the percentual quality of the compression, the dimensionality, the RMSd type used in the compression and whether or no the file contains atom names
--avg: Returns the average structure stored in the compressed file
--evals: Returns the eigenvalues for the stored eigenvectors
--evec <eigenvector>: Returns the requested eigenvalue amongst the list of eigenvectors stored in the compressed file
--proj <eigenvector>: Returns all the projections for the requested eigenvector
--rms <frame>: Computes the RMSd between the given frame and all the others
--fluc <eigenvector>: Computes the atomic fluctuations along trajectory for the requested eigenvector, or for all the trajectory if none given
--bfactor: If this flag is present, the values of atomic fluctuation are given as B-factor values
--anim <eigenvector>: Animates the system along the requested eigenvector
--lindemann: Computes the liquidity/solidity Lindemann coefficient
--collectivity <eigenvector>: Computes a collectivity index of movement for the requested eigenvector
--forcecte <temperature>: Computes the force constants given a simulation temperature
--hinge: Computes hinge points predictions
--mask <mask_string>: Specifies a mask string for the selection of the atoms. The mask has a similar syntax to Amber/ptraj masks
--pdb: If this flag is present, the output format is PDB, if suitable. It is required that the compression was made giving a PDB in the compression process
--verbose: Makes the program more verbose, giving more information about its progress
--help: Displays a short help for the user

Examples of use:

Obtain basic information about a file: $ pczdump -i traj.pcz --info
Obtain the eigenvalues and store the output in a file: $ pczdump -i traj.pcz --evals -o evals.dat
Obtain the first eigenvector and give the output in PDB format: $ pczdump -i traj.pcz --evec 1 --pdb
Obtain the Lindemann coefficient for the sidechains: $ pczdump -i traj.pcz --lindemann --mask "~@C,CA,N,O"
Obtain the force constants for this simulation at 300°K: $ pczdump -i traj.pcz --forcecte 300
Obtain the B-factors for the second eigenvector: $ pczdump -i traj.pcz --fluc=2 --bfactor

7. Mask Strings

Mask strings have been implemented to allow the users to specify the atoms that must be used in a compression or analysis with ease, concisely and without having to generate more files. They have a syntax very similar to that used by Amber utilities but slightly different to allow for greater flexibility.

The masks are composed of atom and residue specifications and the connectors between them. Atoms are preceded by @ and residues by :. After one of those symbols comes the atom/residue specification. It can come in numerical form, giving the atom or residue number, or in alphabetical form, giving the atom or residue name. Multiple specifications can be separated by commas, and when using numeric specifications, ranges can be created by using dashes. Some examples:

@C represents all the Carbon atoms
:GLY represents all the glycine residues
:1-10 represents the first 10 residues
@C,CA,N,O represents all the atoms belonging to the backbone
@2-5,7,9,15-30 represents all the atoms numbered from 2 to 5, 7, 9 and from 15 to 30

All of this specifications can be combined by means of logical constructions and parenthesis. Logical AND (&), logical OR (|) and logical NOT (~) can be used and combined. With this constructions we can ask for the residue 3 and the atom 300 but not the atom 150 with this line: :3&@300&~@150

The is also the chance of using wildcards to complete the names of atoms and residues, so we can select all the Hydrogens with @H*. More examples follow:

:10-20&~@CA,C,N,O represents the sidechains of the residues 10 thru 20
:GLY|(@O*,N*&~:GLU) represents all the GLY residues along with all the Oxygen and Nitrogen atoms that does not belong to GLU residue
@O*&~:GLY represents all the Oxygen atoms that does not belong to a Glycine

--forcecte <temperature>

Computes the force constants given a simulation temperature

--hinge

Computes hinge points predictions

--mask <mask_string>

Specifies amask stringfor the selection of the atoms. The mask has the same form as the Amber/ptraj masks

--pdb

If this flag is present, the output format is PDB, if suitable. It is needed that the compression was made giving a PDB in the compression process

--verbose

Makes the program more verbose, giving more information about its progress

--help

Displays a short help for the user

Examples of use:

Obtain basic information about a file: $ pczdump -i traj.pcz --info
Obtain the eigenvalues and store the output in a file: $ pczdump -i traj.pcz --evals -o evals.dat
Obtain the first eigenvector and give the output in PDB format: $ pczdump -i traj.pcz --evec 1 --pdb
Obtain the Lindemann coefficient for the sidechains: $ pczdump -i traj.pcz --lindemann --mask "~@C,CA,N,O"
Obtain the force constants for this simulation at 300°K: $ pczdump -i traj.pcz --forcecte 300
Obtain the B-factors for the second eigenvector: $ pczdump -i traj.pcz --fluc=2 --bfactor

7. Mask Strings

The masks are composed of atom and residue specifications and the connectors between them. Atoms are preceded by@and residues by:. After one of those symbols comes the atom/residue specification. It can come in numerical form, giving the atom or residue number, or in alphabetical form, giving the atom or residue name. Multiple specifications can be separated by commas, and when using numeric specifications, ranges can be created by using dashes. Some examples:

@Crepresents all the Carbon atoms
:GLYrepresents all the glycine residues
:1-10represents the first 10 residues
@C,CA,N,Orepresents all the atoms belonging to the backbone
@2-5,7,9,15-30represents all the atoms numbered from 2 to 5, 7, 9 and from 15 to 30

The is also the chance of using wildcards to complete the names of atoms and residues, so we can select all the Hydrogens with@H*. More examples follow:

:10-20&~@CA,C,N,Orepresents the sidechains of the residues 10 thru 20
:GLY|(@O*,N*&~:GLU)represents all the GLY residues along with all the Oxygen and Nitrogen atoms that does not belong to GLU residue
@O*&~:GLYrepresents all the Oxygen atoms that does not belong to a Glycine