In order to find duplicate chemical structures, I was writing a little Python script using RDKit (Win10, Python3.11.3, without (Ana)Conda).
The script seems to work, however, since I want to pass this script to someone who has no Python installation, I'm using pyinstaller --onefile CheckSMILESduplicates.py
to create an executable.
This executable will get a size of about >100 MB and starting up the .exe will take about 20 seconds. Is this really the lower file size limit?
Question: How to reduce the size of a PyInstaller RDKit executable?
There seem to be some highly active questions:
- Reducing size of pyinstaller exe
- Reduce pyinstaller executable size
- How to reduce the size of a python exe file? (duplicate)
- Reducing the size of executable created by pyinstaller (without answers so far)
The suggestions were not yet helpful to me and maybe there are RDKit specific "tricks" and "excludes"...
- exclude all the unnecessary packages (How should I know what is not required? Trial and error?)
- use UPX
(this is already using
upx=True
when getting >100 MB) - create a "clean" environment only with the packages which you are using (Why is it not possible with an existing "dirty" environment/installation?)
Thank you for RDKit-specific advice how to reduze the size of the executable.
And for those who are interested in the script (or want to test it):
Input: (minimized/simplified example, e.g. from ChemDraw)
- in ChemDraw select all structures and copy as SMILES
- paste and save into
SMILES.txt
which will give a long string where SMILES are separated by.
C12=CC=CC=C1C=C3C(C=CC=C3)=C2.C45=CC=CC=C4C=CC=C5.C67=CC=CC=C6C=CC8=C7C=CC=C8.C9%10=CC=CC=C9C%11=C(C=CC=C%11)C=C%10.C%12(C=C(C=CC=C%13)C%13=C%14)=C%14C=CC=C%12.C%15(C=C(C=CC=C%16)C%16=C%17)=C%17C=CC=C%15.C%18%19=CC=CC=C%18C=C%20C(C=C(C=CC=C%21)C%21=C%20)=C%19.c%22%23ccccc%22=CC=CC=%23.c%24%25ccccc%24cccc%25
Script:
# Check duplicate SMILES structures
#
# in a ChemDraw file:
# 1. select all structures (Ctrl+A)
# 2. copy as SMILES (ALt+Ctrl+C)
# 3. paste text into text file, e.g. SMILES.txt
# 4. start this program with command line argument "SMILES.txt"
# 5. SMILES will be separated and written into separate lines
# 6. Output: a) total structures found, b) different sum formulas, c) duplicates
from rdkit import Chem
import sys
def read_file(ffname):
print('Loading "{}" ...'.format(ffname))
with open(ffname, "r") as f:
smiles = f.read().strip().split(".")
with open(ffname, "w") as f:
smiles_str = '\n'.join(smiles)
f.write(smiles_str)
return smiles_str.split()
def smiless_to_inchis(smiless):
mols = [Chem.MolFromSmiles(x) for x in smiless]
inchis = [Chem.inchi.MolToInchi(x) for x in mols]
return inchis
def get_inchi_duplicates(inchis):
print("\nTotal structures found: {}".format(len(inchis)))
print("\n".join(inchis))
dict_sumform = {}
dict_inchi = {}
for inchi in inchis:
sumform = inchi.split('/')[1]
if sumform in dict_sumform.keys():
dict_sumform[sumform] += 1
else:
dict_sumform[sumform] = 1
if inchi in dict_inchi.keys():
dict_inchi[inchi] += 1
else:
dict_inchi[inchi] = 1
print("\nDifferent sum formulas: {}".format(len(dict_sumform.keys())))
for x in sorted(dict_sumform.keys()):
print("{} : {}".format(x, dict_sumform[x]))
# duplicates
print("\nDuplicates:")
for inchi in sorted(dict_inchi.keys()):
if dict_inchi[inchi]>1:
print("{}x : {}".format(dict_inchi[inchi], inchi))
if __name__ == '__main__':
print("Checking for duplicate structures...")
if len(sys.argv)>1:
ffname = sys.argv[1]
smiless = read_file(ffname)
inchis = smiless_to_inchis(smiless)
get_inchi_duplicates(inchis)
else:
print("Please give a input file!")
### end of script
- start the script on the console via
py CheckSMILESduplicates.py SMILES.txt
Output:
Loading "SMILES.txt" ...
Total structures found: 9
InChI=1S/C14H10/c1-2-6-12-10-14-8-4-3-7-13(14)9-11(12)5-1/h1-10H
InChI=1S/C10H8/c1-2-6-10-8-4-3-7-9(10)5-1/h1-8H
InChI=1S/C14H10/c1-3-7-13-11(5-1)9-10-12-6-2-4-8-14(12)13/h1-10H
InChI=1S/C14H10/c1-3-7-13-11(5-1)9-10-12-6-2-4-8-14(12)13/h1-10H
InChI=1S/C14H10/c1-2-6-12-10-14-8-4-3-7-13(14)9-11(12)5-1/h1-10H
InChI=1S/C14H10/c1-2-6-12-10-14-8-4-3-7-13(14)9-11(12)5-1/h1-10H
InChI=1S/C18H12/c1-2-6-14-10-18-12-16-8-4-3-7-15(16)11-17(18)9-13(14)5-1/h1-12H
InChI=1S/C10H8/c1-2-6-10-8-4-3-7-9(10)5-1/h1-8H
InChI=1S/C10H8/c1-2-6-10-8-4-3-7-9(10)5-1/h1-8H
Different sum formulas: 3
C10H8 : 3
C14H10 : 5
C18H12 : 1
Duplicates:
3x : InChI=1S/C10H8/c1-2-6-10-8-4-3-7-9(10)5-1/h1-8H
3x : InChI=1S/C14H10/c1-2-6-12-10-14-8-4-3-7-13(14)9-11(12)5-1/h1-10H
2x : InChI=1S/C14H10/c1-3-7-13-11(5-1)9-10-12-6-2-4-8-14(12)13/h1-10H