How to superimpose and calculate the RMSD for a large number of mmCIF(or PDB) files containing DIFFERENT proteins

Question

I would like to ask you all a question about superimposing and calculating the RMSD of multiple mmCIF files at once. I am creating a code that downloads a entire homologous superfamily, which then need to be trimmed down based on a specific RMSD value. I want to automate this process in python (within jupyterLab).

The mmCIF files in question contain different proteins. For now I have tried to use BIO.PDB (MMCIFPParser) to first parse the structure from the first .cif file (called mmcif_ref), and then a list of all other files. I want to compare all other protein structures with the reference and calculate a RMSD. However, the problem is that they don't have the same atoms, which I found on the internet, is one of the main criteria.

My current code doesn't work and gives an error:


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[24], line 1
----> 1 rmsd = calculate_rmsd(mmcif_list, mmcif_ref, mmcif_comp)

Cell In[21], line 7, in calculate_rmsd(mmcif_dir, mmcif_ref, mmcif_comp)
      4 parser = MMCIFParser()
      6 # Parse the structures from the MMCIF files
----> 7 structure1 = parser.get_structure("reference", mmcif_dir + '/' + mmcif_ref)
      8 structure2 = parser.get_structure("comparison", mmcif_dir + '/' + mmcif_comp)
     10 # Select the atoms for superimposition

TypeError: can only concatenate list (not "str") to list

So my question is, seeing my code, what would you advice me to change in order to be able to superimpose multiple different proteins on one reference protein and save only the cif files that meet a specific rmsd value.

I hope someone can help me. Thanks in advance!

# Initialization 
cur_dir = os.getcwd()
mmcif_dir = cur_dir + '/' + protein_name + '/input/cif_files'
output_dir = cur_dir + '/' + protein_name + '/prep'
mmcif_list = []
for file in os.listdir(mmcif_dir):
    if file.endswith('.cif'):
        mmcif_list.append(file)

mmcif_ref = mmcif_list[0]
mmcif_comp = mmcif_list[1:]
print(mmcif_ref)
print(mmcif_comp)

def calculate_rmsd(mmcif_dir, mmcif_ref, mmcif_comp):
    parser = MMCIFParser()

    # Parse the structures from the MMCIF files
    structure1 = parser.get_structure("reference", mmcif_dir + '/' + mmcif_ref)
    structure2 = parser.get_structure("comparison", mmcif_dir + '/' + mmcif_comp)

    # Select the atoms for superimposition
    atoms1 = Selection.unfold_entities(structure1, "N, CA, C")
    atoms2 = Selection.unfold_entities(structure2, "N, CA, C")

    # Create an instance of the Superimposer
    super_imposer = Superimposer()

    # Set the atoms for superimposition
    super_imposer.set_atoms(atoms1, atoms2)

    # Apply the transformation to the atoms of structure2
    super_imposer.apply(structure2.get_atoms())

    # Calculate the RMSD
    rmsd = super_imposer.rms

    return rmsd

rmsd = calculate_rmsd(mmcif_list, mmcif_ref, mmcif_comp)

Welcome to [so]. We don't provide debugging services. The following references give advice on debugging your code. [How to debug small programs](https://ericlippert.com/2014/03/05/how-to-debug-small-programs/), [Six Debugging Techniques for Python Programmers](https://medium.com/techtofreedom/six-debugging-techniques-for-python-programmers-cb25a4baaf4b) or [Ultimate Guide to Python Debugging](https://towardsdatascience.com/ultimate-guide-to-python-debugging-854dea731e1b) — itprorh66, Jun 20 '23 at 17:51
see how pymol does it: https://pymolwiki.org/index.php/Align align performs a sequence alignment followed by a structural superposition, and then carries out zero or more cycles of refinement in order to reject structural outliers found during the fit. align does a good job on proteins with decent sequence similarity (identity >30%) — pippo1980, Jun 23 '23 at 15:06
see here, try it just google it not tested : https://bougui505.github.io/2017/09/11/align_two_pdb_files_using_biopython.html — pippo1980, Jun 24 '23 at 17:05
since you are talking about "a entire homologous superfamily" probably the best way to go would be to align the central if any) conserved core of the proteins to te reference and use it to calculate the superposition matrix [the atom indexes of thecore for both the referece and each member of the family; super_imposer.set_atoms(atoms1, atoms2)] and then apply it to the whole protein [super_imposer.apply(structure2.get_atoms())]. Dont know if Interpro Pfam has them precompiled — pippo1980, Jun 25 '23 at 15:17
see https://bioinformatics.stackexchange.com/questions/14919/assessing-pymol-sequence-alignment-object/14936#14936 and https://bioinformatics.stackexchange.com/questions/19105/perform-protein-structure-based-sequence-alignment-in-python and https://pymolwiki.org/index.php/Cealign , cealign aligns the structures and returns a clustalw alignment object — pippo1980, Jun 26 '23 at 10:05
using from pymol cealign aligns two proteins using the CE algorithm. It is very robust for proteins with little to no sequence similarity (twilight zone). For proteins with decent structural similarity, the super command is preferred and with decent sequence similarity, the align command is preferred, because these commands are much faster than cealign. — pippo1980, Jun 26 '23 at 10:45
Biopython has cealign module https://biopython.org/docs/dev/api/Bio.PDB.cealign.html — pippo1980, Jun 26 '23 at 11:32
Thank you for your replies. I have been trying to use the pymol module, however, anaconda jupyter Lab doesn't want to install it. It continuously fails when trying to solve the environment wit repodata. Now I am trying to reinstall anaconda. However, it is still failing to solve the environment. I have tried to find answers online however, I have not been successful as of yet — y.a, Jun 27 '23 at 09:17

How to superimpose and calculate the RMSD for a large number of mmCIF(or PDB) files containing DIFFERENT proteins

0 Answers0