How to calculate center of mass of proteins using MDAnalysis?

Question

I'm in a bit unusual situation. There are seven different proteins stored in a single file according to their residues names. Each protein has different sequence length. Now I need to calculate the center of mass of each protein and generate a time series data.I know how to do with a single protein, but do not with multiple protein system. For single protein I can do something like this:

import MDAnalysis as mda
import numpy as np

u = mda.Universe('lp400start.gro')
u1 = mda.Merge(u.select_atoms("not resname W and not resname WF and not resname ION"))
u1.load_new('lp400.xtc')
protein = u1.select_atoms("protein")
arr = np.empty((protein.n_residues, u1.trajectory.n_frames, 3))

for ts in u.trajectory:
    arr[:, ts.frame] = protein.center_of_mass(compound='residues')

the data files are publicly available here. The residue sequence in the topology file can be checked using grep "^ *1[A-Z]" -B1 lp400final.gro | grep -v "^ *1[A-Z]", the output:

 38ALA     BB   74  52.901  33.911   6.318
--
   38ALA     BB  148  41.842  29.381   7.211
--
  137GLY     BB  455  36.756   4.287   3.284
--
  137GLY     BB  762  44.721  60.377   3.112
--
  252HIS    SC3 1368  28.682  37.936   6.727
--
  252HIS    SC3 1974  18.533  46.506   6.314
--
  576PHE    SC3 3263  48.937  38.538   4.013
--
  576PHE    SC3 4552  18.513  25.948   3.800
--
 1092PRO    SC1 6470  42.510  40.992   6.775
--
 1092PRO    SC1 8388  14.709   4.759   6.370
--
 1016LEU    SC110524  57.264  56.308   2.632
--
 1016LEU    SC112660  50.716  14.698   2.728
--
 1285LYS    SC215345   0.793  33.529   1.509

First protein has sequence length of 38 residues and it has a copy of its own and then the second protein and so on. Now I want to have the COM of each protein at each time frame and build it into a time series. In addition to proteins topology file also contains the DPPC particles as well. Could someone help me how to do this? Thanks!

To make sure the output trajectory is correct it looks something like this enter link description here

Looks like you are calculating the COM of residues in `compound='residues'`. The easiest way is to select each protein one by one, calculate the COM, and then finally concatenate the arrays. Note that unwrapping the PBC might be tricky with 8 moving protein over longer time. — mateuszb, Feb 28 '22 at 23:21

orbeckst · Accepted Answer · 2022-03-01T15:49:38.443

I would load the system from the TPR file to maintain the bond information. Then MDAnalysis can determine fragments (namely, your proteins). Then loop over the fragments to determine the COM time series:

import MDAnalysis as mda
import numpy as np

# files from https://doi.org/10.5281/zenodo.846428
TPR = "lp400.tpr"
XTC = "lp400.xtc"

# build reduced universe to match XTC
# (ignore warnings that no coordinates are found for the TPR)
u0 = mda.Universe(TPR)
u = mda.Merge(u0.select_atoms("not resname W and not resname WF and not resname ION"))
u.load_new(XTC)

# segments (exclude the last one, which is DPPC and not protein)
protein_segments = u.segments[:-1]

# build the fragments
# (a dictionary with the key as the protein name -- I am using an
# OrderedDict so that the order is the same as in the TPR)
from collections import OrderedDict
protein_fragments = OrderedDict((seg.segid[6:], seg.atoms.fragments) for seg in protein_segments)

# analyze trajectory (with a nice progress bar)
timeseries = []
for ts in mda.log.ProgressBar(u.trajectory):
    coms = []
    for name, proteins in protein_fragments.items():
        # loop over all the different proteins;
        # unwrap to get the true COM under PBC (double check!!)
        coms.extend([p.center_of_mass(unwrap=True) for p in proteins]) 
    timeseries.append(coms)
timeseries = np.array(timeseries)

Note

Double check that unwrap=True is doing the right thing (and that it is necessary — I didn't check if any proteins were split across periodic boundaries).
Unwrapping is slow; if you don't need it, it will run faster.
The resulting array is a 3d array with shape (N_timesteps, M_proteins, 3), namely (10001, 14, 3).

The content of protein_fragments is

OrderedDict([('EPHA', (<AtomGroup with 74 atoms>, <AtomGroup with 74 atoms>)),
           ('OMPA', (<AtomGroup with 307 atoms>, <AtomGroup with 307 atoms>)),
           ('OMPG', (<AtomGroup with 606 atoms>, <AtomGroup with 606 atoms>)),
           ('BTUB', (<AtomGroup with 1289 atoms>, <AtomGroup with 1289 atoms>)),
           ('ATPS', (<AtomGroup with 1918 atoms>, <AtomGroup with 1918 atoms>)),
           ('GLPF', (<AtomGroup with 2136 atoms>, <AtomGroup with 2136 atoms>)),
           ('FOCA', (<AtomGroup with 2685 atoms>, <AtomGroup with 2685 atoms>))])

Thanks for the answer! How to identify which part of the trajectory belongs to which protein? In addition to that the output does not match with the processed file I have. I provided a link of the processed file in the question. — Advaita, Mar 01 '22 at 11:24
I used the files from the Zenodo repo and instead of the GRO file, I used the TPR file which has more accurate information. If you insist on using the GRO file then you need to manually figure out where your proteins are (instead of using fragments). E.g., `protein_fragments['EPHA'] = (u.atoms[:74], u.atoms[74:2*74])` (I am just using the atom information above (which might be wrong) but you can use any kind of [selections](https://userguide.mdanalysis.org/stable/selections.html).) — orbeckst, Mar 01 '22 at 15:53
For more detailed discussions I suggest you ask on the [MDAnalysis DIscussion group](https://groups.google.com/group/mdnalysis-discussion) or on the [MDAnalysis Discord](https://www.mdanalysis.org/#participating) where all the MDAnalysis developers hang out — S/O comment threads are in my experience not the ideal forum for figuring out these problems. — orbeckst, Mar 01 '22 at 15:56
The atom information is correct! But the output time series in seems incorrect! Although, the z-axis seems reasonable! One question! Is it possible to convert this output into the .gro format, which looks like the .gro file I've attached at the end of the question? then it can be easily read. Does MDAnalysis have any functionality to do that? — Advaita, Mar 02 '22 at 14:57
What do you mean by "this output"? The timeseries of positions is just an array. You can write the first frame of the universe to a GRO file with `u.trajectory[0]; u.atoms.write("system.gro")`. — orbeckst, Mar 03 '22 at 16:22
How do you know that the coordinates are incorrect? As noted, if the molecules are already made whole in the trajectory (I didn't check) then you do not need the `unwrap=True` argument. You need to experiment there a little bit and possibly visualize the trajectory to double check. — orbeckst, Mar 03 '22 at 16:24

How to calculate center of mass of proteins using MDAnalysis?

1 Answers1

Linked