PCAsuite

Abstract

This program suite allows the user to compress Molecular Dynamics (MD) trajectories using Principal Component Analysis (PCA) algorithms. This technique offers a good compression ratio at the expense of losing some precision in the trajectory.

Overview

People working with Molecular Dynamics simulations know that the output files of their work are huge. But they need to save their work for later study and available space disappears quickly. Traditional general compression algorithms like LZW has been used in order to reduce the required space, but they usually does not perform well with this kind of data. But we know that this data is no random, it follows a pattern and has a well defined meaning which we can exploit in our effort to reduce the size of the data. We also know that most of the time, we does not need to store all the details. Higher frequency movements can be due to temperature or other irrelevant factors for some kinds of analysis.

This knowledge points us towards the use of PCA techniques. This methods try to change the coordinate space of the system being analyzed to capture the maximum amount of variance with the minimum effort. This methods also allows us to select the degree of fidelity to the original trajectory that we need, so if we need a very accurate trajectory, we can choose not to compress so much the data in order to have more detail.

Principal Component Analysis is a technique used to reduce the dimensionality of a dataset. It is an orthogonal linear transformation that transforms data into a new coordinate system such that the greatest variance by any projection of the data lays on the first coordinate, the second greatest variance lays on the second and so on. When we get the first coordinates, we are getting the movements that contribute most to the variance of that dataset. In that case, the least coordinates can be ignored because their contribution is very low, hence we store less information but the analysis can be performed in the same way.

Operation

There are different ways to apply this technique. I will explain the method used by this suite.

Let's suppose we have a MD trajectory of N atoms and F frames. The first action is to prepare the input for the real processing and compression. We must recenter the different snapshots onto a representative structure in order to have a good compression. This action is performed in three steps:

The recentering is performed looking for the best RMSd fit. This value can be computed using two different algorithms:

The gaussian algorithm may help to reduce the number of eigenvectors needed for a given compression, thus reducing the size of the compressed file. The gaussian algorithm also allows for other analysis, like the hinge point prediction, much more difficult and imprecise using a standard RMSd algorithm.

The first step is to compute the covariance matrix of the trajectory, where the random variables are the coordinates of the N atoms. This leads to a symmetric square matrix of 3N*3N dimensions. Once we have this matrix we diagonalize it in order to get the associated eigenvalues and eigenvectors. It turns out that the sum of all the eigenvalues is the total variance of the trajectory, and the individual eigenvalues is the variance quantity explained by the corresponding eigenvector.

With this data we can select and save enough eigenvectors to preserve an arbitrary amount of variance and discard the ones that give us little to no information.

Once the eigenvectors has been selected, a projection of the original data into the new coordinate space must be performed. The final output of the algorithm must contain the mean structure, the eigenvectors and the projections in order to restore the original file. Our suite also stores some other kind of information like the eigenvalues and the atom names to allow us to perform the analysis and manipulations in a quick and more flexible way.

File formats

Through the evolution of this code, the format used by the compressed files has changed along time. This suite works natively with the PCZ4 format, but it also supports reading of files written in PCZ2 and PCZ3 formats. All the formats are binary-based, being the PCZ2 and PCZ3 very similar and PCZ4 adding new features.

The most important difference between PCZ2/3 and PCZ4 is the lack of the atom names in the former. This makes impossible to use masks to select atoms.

Installation

The code is quite standard so no difficult steps will be needed. In the first place, you need to download the tarball from pcasuite.tar.gz. Once we have the tarball, we must uncompress it, enter into the newly created pcazip folder and compile the source with make.

Prior to executing the compilation we must choose a compiler and adjust the Makefile with the proper flags. This is done through the config.mk file. This file is a soft link to a file with the proper flags. Some files for different compilers are provided:

There is a default link to compile with GCC v4. In summary, the steps needed to compile the application are:

  1. $ wget http://mmb.pcb.ub.es/software/pcasuite.tar.gz
  2. $ tar xf pcasuite.tar.gz
  3. $ cd pcasuite
  4. $ rm config.mk
  5. $ ln -s config.gcc4 config.mk
  6. $ make

This procedure compiles the source code and leave the binaries in the same folder. This binaries can be moved to a proper place to be executed easily.

Some adjustment may be required in order to have the code compiled if the libraries required are not stored in standard paths. A local config.mk file can be generated to adapt the compiler to the computer where the suite is being compiled.

Utilities

PCAzip

This tool is the main compression engine. It performs the steps outlined in the Operation section. This tool reads the trajectory, recenters the frames, computes the covariance matrix, computes the eigenvectors, the projections for the trajectory onto the new coordinate space and finally writes the compressed file.

Its operation mode is quite simple, just needing the input trajectory, a PDB to get the atom names from and some CPU time. The syntax is the next: $ pcazip -i <input_trajectory> -p <input_PDB_file> -o <output_file> [options]

Complete list of supported options:

-i <input_file>
Specifies the name of the file containing the trajectory in an Amber-like format
-o <output_file>
Specifies the name of the file that will store the compressed trajectory
-p <pdb_file>
Specifies the name of a PDB-like file containing the names for the atoms in the input trajectory
-n <number_of_atoms>
Specifies how many atoms the trajectory has. It can be used when no PDB file can be specified, but the functionalities that depends on the atoms and residue names will be disabled
-m <mask_file>
Specifies the name of a PDB-like file containing the atoms that should be taken into account when compressing the trajectory. Only this atoms will be used in the compression process and only this atoms will be in the output file
-M <mask_string>
Specifies a mask string for the selection of the atoms. It serves the same purpose than the -m switch, but in works in a descriptive way. The mask has the same form as the Amber/ptraj masks
-e <number_of_eigenvectors>
Specifies the number of eigenvectors that must be stored in the file
-q <quality>
Specifies the quality of the compression as a percentage value. It represents the percentage of the total variance that must be explained by the stored eigenvalues
-g
If specified, the protein recentering is performed with a gaussian version of the RMSd algorithm. If not specified, a standard RMSd algorithm is used
-v
Makes the program more verbose, giving more information about its progress
-h
Displays a short help for the user

Examples of use:

Compress a trajectory, including atom name information, with a 90% of quality
$ pcazip -i traj.x -p traj.pdb -o traj.pcz
Compress a trajectory, without atom name information, with a 95% of quality
$ pcazip -i traj.x -n numberOfAtoms -o traj.pcz -q 95
Compress a trajectory, including atom name information, using the gaussian RMSd and taking the first 20 eigenvectors
$ pcazip -i traj.x -p traj.pdb -o traj.pcz -e 20 -g
Compress the backbone of a trajectory, including atom name information, and asking for a verbose output
$ pcazip -i traj.x -p traj.pdb -o traj.pcz -v -M @C,CA,N,O

PCAunzip

This tool serves the purpose of reconstructing the trajectory from the compressed data. It works by retrieving the eigenvectors, the associate projections, and operating with them until we accumulate the original data. This tool is quite simple and easy to use, and mimics the parameters of the other tools in the suite. This is the complete list of supported options:

-i <input_file>
Specifies the name of the file containing the compressed trajectory
-o <output_file>
Specifies the name of the file that will store the uncompressed trajectory in an Amber-like format
--pdb
If this flag is present, the output format is PDB. It is needed that the compression was made giving a PDB in the compression process
-v
Makes the program more verbose, giving more information about its progress
-h
Displays a short help for the user

Examples of use:

Uncompress a trajectory to an Amber-like file
$ pcaunzip -i traj.pcz -o traj.x
Uncompress a trajectory to a PDB-like file
$ pcaunzip -i traj.pcz -o traj.x --pdb

PCZdump

This is the tool used when we need to analyze and query the data stored inside the compressed file. It allows the user to query for the values stored directly in the file and also to compute other values based on the stored ones.

The information that can be retrieved with this tool are:

The supported options are:

-i <input_file>
Specifies the name of the file containing the compressed trajectory
-o <output_file>
Specifies the name of the file that will store the output of the query
--info
Returns basic information of the file. It gives the title of the trajectory, the Number of atoms, eigenvectors and frames, the total and explained variance, the percentual quality of the compression, the dimensionality, the RMSd type used in the compression and whether or no the file contains atom names
--avg
Returns the average structure stored in the compressed file
--evals
Returns the eigenvalues for the stored eigenvectors
--evec <eigenvector>
Returns the requested eigenvalue amongst the list of eigenvectors stored in the compressed file
--proj <eigenvector>
Returns all the projections for the requested eigenvector
--rms <frame>
Computes the RMSd between the given frame and all the others
--fluc <eigenvector>
Computes the atomic fluctuations along trajectory for the requested eigenvector, or for all the trajectory if none given
--bfactor
If this flag is present, the values of atomic fluctuation are given as B-factor values
--anim <eigenvector>
Animates the system along the requested eigenvector
--lindemann
Computes the liquidity/solidity Lindemann coefficient
--collectivity <eigenvector>
Computes a collectivity index of movement for the requested eigenvector
--forcecte <temperature>
Computes the force constants given a simulation temperature
--hinge
Computes hinge points predictions
--mask <mask_string>
Specifies a mask string for the selection of the atoms. The mask has the same form as the Amber/ptraj masks
--pdb
If this flag is present, the output format is PDB, if suitable. It is needed that the compression was made giving a PDB in the compression process
--verbose
Makes the program more verbose, giving more information about its progress
--help
Displays a short help for the user

Examples of use:

Obtain basic information about a file
$ pczdump -i traj.pcz --info
Obtain the eigenvalues and store the output in a file
$ pczdump -i traj.pcz --evals -o evals.dat
Obtain the first eigenvector and give the output in PDB format
$ pczdump -i traj.pcz --evec 1 --pdb
Obtain the Lindemann coefficient for the sidechains
$ pczdump -i traj.pcz --lindemann --mask "~@C,CA,N,O"
Obtain the force constants for this simulation at 300°K
$ pczdump -i traj.pcz --forcecte 300
Obtain the B-factors for the second eigenvector
$ pczdump -i traj.pcz --fluc=2 --bfactor

Mask Strings

Mask strings has been implemented to allow the users to specify the atoms that must be used in a compression or analysis with ease, concisely and without having to generate more files. They have a syntax very similar to that used by Amber utilities but slightly different to allow for greater flexibility.

The masks are composed of atom and residue specifications and the connectors between them. Atoms are preceded by @ and residues by :. After one of those symbols comes the atom/residue specification. It can come in numerical form, giving the atom or residue number, or in alphabetical form, giving the atom or residue name. Multiple specifications can be separated by commas, and when using numeric specifications, ranges can be created by using dashes. Some examples:

All of this specifications can be combined by means of logical constructions and parenthesis. Logical AND (&), logical OR (|) and logical NOT (~) can be used and combined. With this constructions we can ask for the residue 3 and the atom 300 but not the atom 150 with this line: :3&@300&~@150

The is also the chance of using wildcards to complete the names of atoms and residues, so we can select all the Hydrogens with @H*. More examples follows:

Parallelism in PCAzip

PCAzip is the most time-consuming application of the suite. The more consuming routines of the algorithm are the covariance matrix generation and the eigenvector calculus. This routines has been targeted for parallelization witn MPI, the standard paradigm for distributed memory parallelization of code.

The paralellization of the eigenvector computation has been achieved through the use of the ScaLAPACK library, which is a highly optimized and portable library written in Fortran. Most hardware vendors offer their own optimized version for its own machines. The parallelization method of ScaLAPACK has been through MPI for distributed memory machines, which has conditioned the decision of using MPI for the parallelization of the covariance matrix calculus.

The parallelization of the covariance matrix computation has been done with MPI because the ScaLAPACK library use this methodology, but also because is a more scalable technology than the ones for shared memory machines like OpenMP. Using shared memory machines we are limited to use only the processors physically present on the machine. No connection to other networked processors can be done. Using a distributed memory approach this limitation is overcomed and remote processors can be used, for example, the ones of a computer cluster. The parallelization effort is greater in the message passing paradigm, but it compensates by its scalability.

After benchmarking the resulting code, we can see that the covariance matrix calculation scales really well, although the eigenvector calculation shows a poorer performance. This means that we must choose carefully the number of processors in order not to waste resources. If the matrix covariance calculus is expected to be heavier than the diagonalization, then more processors can be added without fear. But if the diagonalization process is the most time-consuming, then we must adjust the number of processors for a good efficiency of this step.

Information for developers

The source code of PCAzip utility has been prepared to support multiple input file formats. This has been achieved through a file recognition process. It is very easy to add a new file format to the compressor, modifying only one file and generating another.

The core of the file recognition engine are the methods in the traj_io.c file. When a file is opened with trajopen method, different input modules are tried in order until we find a module that accepts this kind of file.

The methods used to identify the files are stored in function pointer arrays:

Each array contains a pointer to the methods used to open, read, identify or close a format. This means that you must provide a new C code file that provides the code needed to open a file with the format you are interested in, the code needed to read a snapshot from a previously opened file, the code needed to identify a file with the format and the code needed to close a file with the format.

Examples of this methods can be found in the binpos_io.c file, which contains the code needed to read binpos binary files.