1

In order to find duplicate chemical structures, I was writing a little Python script using RDKit (Win10, Python3.11.3, without (Ana)Conda). The script seems to work, however, since I want to pass this script to someone who has no Python installation, I'm using pyinstaller --onefile CheckSMILESduplicates.py to create an executable.

This executable will get a size of about >100 MB and starting up the .exe will take about 20 seconds. Is this really the lower file size limit?

Question: How to reduce the size of a PyInstaller RDKit executable?

There seem to be some highly active questions:

The suggestions were not yet helpful to me and maybe there are RDKit specific "tricks" and "excludes"...

  • exclude all the unnecessary packages (How should I know what is not required? Trial and error?)
  • use UPX (this is already using upx=True when getting >100 MB)
  • create a "clean" environment only with the packages which you are using (Why is it not possible with an existing "dirty" environment/installation?)

Thank you for RDKit-specific advice how to reduze the size of the executable.

And for those who are interested in the script (or want to test it):

Input: (minimized/simplified example, e.g. from ChemDraw)

enter image description here

  • in ChemDraw select all structures and copy as SMILES
  • paste and save into SMILES.txt which will give a long string where SMILES are separated by .
C12=CC=CC=C1C=C3C(C=CC=C3)=C2.C45=CC=CC=C4C=CC=C5.C67=CC=CC=C6C=CC8=C7C=CC=C8.C9%10=CC=CC=C9C%11=C(C=CC=C%11)C=C%10.C%12(C=C(C=CC=C%13)C%13=C%14)=C%14C=CC=C%12.C%15(C=C(C=CC=C%16)C%16=C%17)=C%17C=CC=C%15.C%18%19=CC=CC=C%18C=C%20C(C=C(C=CC=C%21)C%21=C%20)=C%19.c%22%23ccccc%22=CC=CC=%23.c%24%25ccccc%24cccc%25

Script:

# Check duplicate SMILES structures
#
# in a ChemDraw file:
# 1. select all structures (Ctrl+A)
# 2. copy as SMILES (ALt+Ctrl+C)
# 3. paste text into text file, e.g. SMILES.txt
# 4. start this program with command line argument "SMILES.txt"
# 5. SMILES will be separated and written into separate lines
# 6. Output: a) total structures found, b) different sum formulas, c) duplicates

from rdkit import Chem
import sys

def read_file(ffname):
    print('Loading "{}" ...'.format(ffname))
    with open(ffname, "r") as f:
        smiles = f.read().strip().split(".")
    with open(ffname, "w") as f:
        smiles_str = '\n'.join(smiles)
        f.write(smiles_str)
    return smiles_str.split()

def smiless_to_inchis(smiless):
    mols   = [Chem.MolFromSmiles(x) for x in smiless]
    inchis = [Chem.inchi.MolToInchi(x) for x in mols]
    return inchis

def get_inchi_duplicates(inchis):
    print("\nTotal structures found: {}".format(len(inchis)))
    print("\n".join(inchis))
    dict_sumform = {}
    dict_inchi   = {}
    for inchi in inchis:
        sumform = inchi.split('/')[1]
        if sumform in dict_sumform.keys():
            dict_sumform[sumform] += 1
        else:
            dict_sumform[sumform]  = 1
        if inchi in dict_inchi.keys():
            dict_inchi[inchi] += 1
        else:
            dict_inchi[inchi]  = 1
    print("\nDifferent sum formulas: {}".format(len(dict_sumform.keys())))
    for x in sorted(dict_sumform.keys()):
        print("{} : {}".format(x, dict_sumform[x]))
    # duplicates
    print("\nDuplicates:")
    for inchi in sorted(dict_inchi.keys()):
        if dict_inchi[inchi]>1:
            print("{}x : {}".format(dict_inchi[inchi], inchi))

if __name__ == '__main__':
    print("Checking for duplicate structures...")
    if len(sys.argv)>1:
        ffname = sys.argv[1]
        smiless = read_file(ffname)
        inchis = smiless_to_inchis(smiless)
        get_inchi_duplicates(inchis)
    else:
        print("Please give a input file!")
### end of script
  • start the script on the console via py CheckSMILESduplicates.py SMILES.txt

Output:

Loading "SMILES.txt" ...

Total structures found: 9
InChI=1S/C14H10/c1-2-6-12-10-14-8-4-3-7-13(14)9-11(12)5-1/h1-10H
InChI=1S/C10H8/c1-2-6-10-8-4-3-7-9(10)5-1/h1-8H
InChI=1S/C14H10/c1-3-7-13-11(5-1)9-10-12-6-2-4-8-14(12)13/h1-10H
InChI=1S/C14H10/c1-3-7-13-11(5-1)9-10-12-6-2-4-8-14(12)13/h1-10H
InChI=1S/C14H10/c1-2-6-12-10-14-8-4-3-7-13(14)9-11(12)5-1/h1-10H
InChI=1S/C14H10/c1-2-6-12-10-14-8-4-3-7-13(14)9-11(12)5-1/h1-10H
InChI=1S/C18H12/c1-2-6-14-10-18-12-16-8-4-3-7-15(16)11-17(18)9-13(14)5-1/h1-12H
InChI=1S/C10H8/c1-2-6-10-8-4-3-7-9(10)5-1/h1-8H
InChI=1S/C10H8/c1-2-6-10-8-4-3-7-9(10)5-1/h1-8H

Different sum formulas: 3
C10H8 : 3
C14H10 : 5
C18H12 : 1

Duplicates:
3x : InChI=1S/C10H8/c1-2-6-10-8-4-3-7-9(10)5-1/h1-8H
3x : InChI=1S/C14H10/c1-2-6-12-10-14-8-4-3-7-13(14)9-11(12)5-1/h1-10H
2x : InChI=1S/C14H10/c1-3-7-13-11(5-1)9-10-12-6-2-4-8-14(12)13/h1-10H
theozh
  • 22,244
  • 5
  • 28
  • 72

0 Answers0