0

I'm working on trying to use ligands that are referenced in UniProt with the same ligand in PDB entries. For many ligands (e.g. FAD), the three-letter code is the same in both UniProt and PDB entries, but for some there is a slight difference. For example, for haemoglobin 1a9w chain A, in the PDB file I find "HEM" but in the corresponding UniProt entry (P69905) I find "heme b". "heme b" (in the UniProt json) has chebi id CHEBI:60344.

I downloaded the full ChEBI sdf file from https://ftp.ebi.ac.uk/pub/databases/chebi/SDF/, and find there are three haems that are close to what I want. So far, so good.

If I use the following code to calculate Tanimoto coefficients using CHEBI:60344 as a reference, one of the haems is okay but the other raises a C++ exception that I haven't been able to catch in my Python code. The problem is that if my list of chebi ids is the other way round, the code always fails before I get a value for the Tanimoto coefficient.

My question is - is this a bug in my implementation of the RDKIT code, is it a bug in the RDKIT code, is it a bug in the ChEBI module of bioservices, is the SMILES string in the ChEBI sdf file written incorrectly, or is there another issue?

This is all using conda installed rdkit, bioservices, python3.9 etc on a (old) Mac Pro running High Sierra (can't upgrade to a newer OS).

Ran this code:

from rdkit import Chem, DataStructs
from bioservices import ChEBI
heme = ChEBI()
heme_chebi_id = "CHEBI:60344"
heme_smiles = heme.getCompleteEntity(heme_chebi_id).smiles
target = Chem.MolFromSmiles(heme_smiles)
fp2 = Chem.RDKFingerprint(target)
for chebi_id in ["CHEBI:17627", "CHEBI:26355"]:
    ch = ChEBI()
    smiley = ch.getCompleteEntity(chebi_id).smiles
    print("reference:", heme_chebi_id)
    print("target:   ", chebi_id)
    print("reference:", heme_smiles)
    print("target:   ", smiley)
    ref = Chem.MolFromSmiles(smiley)
    fp1 = Chem.RDKFingerprint(ref)
    Tan = DataStructs.TanimotoSimilarity(fp1, fp2)
    print(Tan)
    print("-" * 64)
exit()

got this output:

reference: CHEBI:60344
target:    CHEBI:17627
reference: CC1=C(CCC([O-])=O)C2=[N+]3C1=Cc1c(C)c(C=C)c4C=C5C(C)=C(C=C)C6=[N+]5[Fe--]3(n14)n1c(=C6)c(C)c(CCC([O-])=O)c1=C2

target:    CC1=C(CCC(O)=O)C2=[N+]3C1=Cc1c(C)c(C=C)c4C=C5C(C)=C(C=C)C6=[N+]5[Fe--]3(n14)n1c(=C6)c(C)c(CCC(O)=O)c1=C2
Tanimoto coefficient: 1.0

reference: CHEBI:60344
target:    CHEBI:26355
reference: CC1=C(CCC([O-])=O)C2=[N+]3C1=Cc1c(C)c(C=C)c4C=C5C(C)=C(C=C)C6=[N+]5[Fe--]3(n14)n1c(=C6)c(C)c(CCC([O-])=O)c1=C2
target:    CC1=C(CCC(O)=O)C2=[N]3C1=Cc1c(C)c(C=C)c4C=C5C(C)=C(C=C)C6=[N]5[Fe]3(n14)n1c(=C6)c(C)c(CCC(O)=O)c1=C2

[12:36:26] Explicit valence for atom # 9 N, 4, is greater than permitted
Traceback (most recent call last):
File "/Volumes/Users/harry/icl/phyre2-ligand/./tanimoto_test.py", line 20, in <module>
fp1 = Chem.RDKFingerprint(ref)
Boost.Python.ArgumentError: Python argument types in
rdkit.Chem.rdmolops.RDKFingerprint(NoneType)
did not match C++ signature:
RDKFingerprint(RDKit::ROMol mol, unsigned int minPath=1, unsigned int maxPath=7, unsigned int fpSize=2048, unsigned int nBitsPerHash=2, bool useHs=True, double tgtDensity=0.0, unsigned int minSize=128, bool branchedPaths=True, bool useBondOrder=True, boost::python::api::object atomInvariants=0, boost::python::api::object fromAtoms=0, boost::python::api::object atomBits=None, boost::python::api::object bitInfo=None)
betelgeuse
  • 1,136
  • 3
  • 13
  • 25

2 Answers2

2

This error means that the input to the function Chem.RDKFingerprint is None. That means that ref is None. You can try printing the value of ref to verify.

In this case, this is None because RdKit is not able to parse the given SMILES to a proper mol object. It has even raised the following warning if you look at the error carefully:

Explicit valence for atom # 9 N, 4, is greater than permitted

This is because of the co-ordinate bond present in the molecule which RdKit doesn't support. RdKit will treat it as a single bond which will raise the valency of both the Nitrogen atoms to 4 and hence an invalid molecule. Here's the same molecule generated from other sources:

ref molecule

To deal with this error, you'll have to modify the SMILES manually to make it such that either there's a charge on those nitrogen atoms or [Fe] is a separate atom rather than connected with a bond. Something like this:

modified ref

This isn't really an issue with the SMILES but more of a limitation with RDKit for its inability to support co-ordinate bonds. I have faced this issue many times and always had to modify the SMILES manually to get around it. One suggestion for you is that you can programmatically modify the SMILES because this kind of error will most likely occur for Metal-Ligand catalysts where a co-ordinate bond is almost always there. So you can search for atoms like [Fe] or [Pt] in the SMILES string and then modify them.

betelgeuse
  • 1,136
  • 3
  • 13
  • 25
  • So in other words it's actually an issue with the "ChEBI_complete.sdf.gz" file that I downloaded from the EBI (or perhaps the individual sdf file downloaded with the bioservices ChEBI getCompleteEntity method, which could possibly be different). I'm afraid that I have to take the SMILES strings as I get them from major public databases, so editing each one that is aberrant isn't really feasible (though I'll certainly tell the folks at the EBI that there is a problem with this one). – Ubuntu-on-Mac Jan 25 '23 at 15:13
  • 1
    Check my updated answer. I added some comments that might be helpful. – betelgeuse Jan 25 '23 at 23:04
  • Ah, okay. That is very helpful, thank you. I've been in touch with the EBI about this and their answer is (to paraphrase) that their SMILES strings are IUPAC compliant and that the issue is with RDKit (as you suggest). So now I'm waiting on a response from the RDKit maintainers to see what they have to say. – Ubuntu-on-Mac Jan 26 '23 at 09:26
  • While waiting for an answer from RDKit, I noticed that there's a flag for MolFromSmiles "sanitize=False". If I use this in my code, I get a mol, and that doesn't cause the TanimotoSimilarity to barf - which is really all I want, because all I'm doing is checking whether the ligand in UniProt is similar to the ligand in PDB when they've been given different names. For the time being, this will allow me to proceed while ignoring chemical rigour. – Ubuntu-on-Mac Jan 26 '23 at 10:15
  • The sanitize option was good provided I only wanted Tanimoto coefficients. If I want FCFP4 values (to provide an alternative measure of similarity, using "GetMorganFingerprint)), that fails if I use "sanitize". The answer to that was to trap the returned None and get a SMILES from pubchem, which is different enough that I can proceed. This SMILES may not be IUPAC compliant (I haven't checked) but it gets me past this issue. – Ubuntu-on-Mac Jan 30 '23 at 09:06
  • Further to my comment about "waiting for an answer from RDKit", they actually replied very quickly and are investigating how to address this in their code. – Ubuntu-on-Mac Jan 30 '23 at 09:13
0

I've managed to get a couple of workarounds for this.

The problem arises because RDKit is (as of 30 Jan 2023) unable to process some IUPAC compliant SMILES (as noted in betelgeuse's answers).

One thing to do is to use the "Sanitize=False" option for rdkit.Chem.MolFromSmiles - this allows a non-None value to be returned for this SMILES, and subsequently, rdkit.Chem.RDKFingerprint returns a useful value.

However, using the results of the "Sanitize=False" option fails if I want to explore an alternative measure of similarity, e.g. FCFP4 instead of Tanimoto, using "rdkit.Chem.rdMolDescriptors.GetMorganFingerprint"; the way I got round this was to test for "None" from MolFromSmiles without using sanitize=False, retrieve an alternative SMILES from PubChem and use that. Having said that, if I didn't really want the SMILES from PDBeChem, I could have done that in the first place...