I'm working on trying to use ligands that are referenced in UniProt with the same ligand in PDB entries. For many ligands (e.g. FAD), the three-letter code is the same in both UniProt and PDB entries, but for some there is a slight difference. For example, for haemoglobin 1a9w chain A, in the PDB file I find "HEM" but in the corresponding UniProt entry (P69905) I find "heme b". "heme b" (in the UniProt json) has chebi id CHEBI:60344.
I downloaded the full ChEBI sdf file from https://ftp.ebi.ac.uk/pub/databases/chebi/SDF/, and find there are three haems that are close to what I want. So far, so good.
If I use the following code to calculate Tanimoto coefficients using CHEBI:60344 as a reference, one of the haems is okay but the other raises a C++ exception that I haven't been able to catch in my Python code. The problem is that if my list of chebi ids is the other way round, the code always fails before I get a value for the Tanimoto coefficient.
My question is - is this a bug in my implementation of the RDKIT code, is it a bug in the RDKIT code, is it a bug in the ChEBI module of bioservices, is the SMILES string in the ChEBI sdf file written incorrectly, or is there another issue?
This is all using conda installed rdkit, bioservices, python3.9 etc on a (old) Mac Pro running High Sierra (can't upgrade to a newer OS).
Ran this code:
from rdkit import Chem, DataStructs
from bioservices import ChEBI
heme = ChEBI()
heme_chebi_id = "CHEBI:60344"
heme_smiles = heme.getCompleteEntity(heme_chebi_id).smiles
target = Chem.MolFromSmiles(heme_smiles)
fp2 = Chem.RDKFingerprint(target)
for chebi_id in ["CHEBI:17627", "CHEBI:26355"]:
ch = ChEBI()
smiley = ch.getCompleteEntity(chebi_id).smiles
print("reference:", heme_chebi_id)
print("target: ", chebi_id)
print("reference:", heme_smiles)
print("target: ", smiley)
ref = Chem.MolFromSmiles(smiley)
fp1 = Chem.RDKFingerprint(ref)
Tan = DataStructs.TanimotoSimilarity(fp1, fp2)
print(Tan)
print("-" * 64)
exit()
got this output:
reference: CHEBI:60344
target: CHEBI:17627
reference: CC1=C(CCC([O-])=O)C2=[N+]3C1=Cc1c(C)c(C=C)c4C=C5C(C)=C(C=C)C6=[N+]5[Fe--]3(n14)n1c(=C6)c(C)c(CCC([O-])=O)c1=C2
target: CC1=C(CCC(O)=O)C2=[N+]3C1=Cc1c(C)c(C=C)c4C=C5C(C)=C(C=C)C6=[N+]5[Fe--]3(n14)n1c(=C6)c(C)c(CCC(O)=O)c1=C2
Tanimoto coefficient: 1.0
reference: CHEBI:60344
target: CHEBI:26355
reference: CC1=C(CCC([O-])=O)C2=[N+]3C1=Cc1c(C)c(C=C)c4C=C5C(C)=C(C=C)C6=[N+]5[Fe--]3(n14)n1c(=C6)c(C)c(CCC([O-])=O)c1=C2
target: CC1=C(CCC(O)=O)C2=[N]3C1=Cc1c(C)c(C=C)c4C=C5C(C)=C(C=C)C6=[N]5[Fe]3(n14)n1c(=C6)c(C)c(CCC(O)=O)c1=C2
[12:36:26] Explicit valence for atom # 9 N, 4, is greater than permitted
Traceback (most recent call last):
File "/Volumes/Users/harry/icl/phyre2-ligand/./tanimoto_test.py", line 20, in <module>
fp1 = Chem.RDKFingerprint(ref)
Boost.Python.ArgumentError: Python argument types in
rdkit.Chem.rdmolops.RDKFingerprint(NoneType)
did not match C++ signature:
RDKFingerprint(RDKit::ROMol mol, unsigned int minPath=1, unsigned int maxPath=7, unsigned int fpSize=2048, unsigned int nBitsPerHash=2, bool useHs=True, double tgtDensity=0.0, unsigned int minSize=128, bool branchedPaths=True, bool useBondOrder=True, boost::python::api::object atomInvariants=0, boost::python::api::object fromAtoms=0, boost::python::api::object atomBits=None, boost::python::api::object bitInfo=None)