Generate 2d images of molecules from PubChem FTP data

Question

Rather than crawl PubChem's website, I'd prefer to be nice and generate the images locally from the PubChem ftp site:

ftp://ftp.ncbi.nih.gov/pubchem/specifications/

The only problem is that I'm limited to OSX and Linux and I can't seem to find a way of programmatically generating the 2d images that they have on their site. See this example:

https://pubchem.ncbi.nlm.nih.gov/compound/6#section=Top

Under the heading "2D Structure" we have this image here:

https://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?cid=6&t=l

That is what I'm trying to generate.

https://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST_Tutorial.html#_Toc409516689 — zero323, Sep 17 '15 at 14:49

David Hoksza · Answer 1 · 2016-08-05T07:37:40.360

If you want something working out of the box I would suggest using molconvert from ChemAxon's Marvin (https://www.chemaxon.com/products/marvin/), which is free for academics. It can be used easily from the command line and it supports plenty of input and output formats. So for your example it would be:

molconvert "png" -s "C1=CC(=C(C=C1[N+](=O)[O-])[N+](=O)[O-])Cl" -o cdnb.png

Resulting in the following image:

It also allows you to set parameters such as width, height, quality, background color and so on.

However, if you are a programmer I would definitely recommend RDKit. Follows a code which generates images for a pair of compounds given as smiles.

from rdkit import Chem
from rdkit.Chem import Draw

ms_smis = [["C1=CC(=C(C=C1[N+](=O)[O-])[N+](=O)[O-])Cl", "cdnb"],
           ["C1=CC(=CC(=C1)N)C(=O)N", "3aminobenzamide"]]
ms = [[Chem.MolFromSmiles(x[0]), x[1]] for x in ms_smis]

for m in ms: Draw.MolToFile(m[0], m[1] + ".svg", size=(800, 800))

This gives you following images:

As far as ChemAxon's Marvin is a commercial product mentioning so would be nice, especially if you're affiliated with the vendor somehow. Otherwise please add a reference to `RDKit` and provide a short example how to use this to solve the question. — jlapoutre, Aug 04 '16 at 20:42
Jchem has an academic licence giving access to all of their tools. I added this information in the answer together with code example in RDKit. — David Hoksza, Aug 05 '16 at 07:39

zachaysan · Answer 2 · 2015-09-17T21:03:46.827

So I also emailed the PubChem guys and they got back to me very quickly with this response:

The only bulk access we have to images is through the download service: https://pubchem.ncbi.nlm.nih.gov/pc_fetch/pc_fetch.cgi
You can request up to 50,000 images at a time.

Which is better than I was expecting, but still not amazing since it requires downloading things that I in theory could generate locally. So I'm leaving this question open until some kind soul writes an open source library to do the same.

Edit:

I figure I might as well save people some time if they are doing the same thing as I am. I've created a Ruby Gem backed on Mechanize to automate the downloading of images. Please be kind to their servers and only download what you need.

https://github.com/zachaysan/pubchem

gem install pubchem

score 0 · Answer 3 · answered May 03 '20 at 02:05

An open source option is the Indigo Toolkit, which also has pre-compiled packages for Linux, Windows, and MacOS and language bindings for Python, Java, .NET, and C libraries. I chose the 1.4.0 beta.

I had a similar interest to yours in converting SMILES to 2D structures and adapted my Python to address your question and to capture timing information. It uses the PubChem FTP (Compound/Extras) download of CID-SMILES.gz. The following script is an implementation of a local SMILES-to-2D-structure converter that reads a range of rows from the PubChem CID-SMILES file of isomeric SMILES (which contains over 102 million compound records) and converts the SMILES to PNG images of the 2D structures. In three tests with 1000 SMILES-to-structure conversions, it took 35, 50, and 60 seconds to convert 1000 SMILES at file row offsets of 0, 100,000, and 10,000,000 on my Windows 10 laptop (Intel i7-7500U CPU, 2.70GHz) with a solid state drive and running Python 3.7.4. The 3000 files totaled 100 MB in size.

from indigo import *
from indigo.renderer import *
import subprocess
import datetime

def timerstart():
    # start timer and print time, return start time
    start = datetime.datetime.now()
    print("Start time =", start)

    return start

def timerstop(start):
    # end timer and print time and elapsed time, return elapsed time
    endtime = datetime.datetime.now()
    elapsed = endtime - start
    print("End time =", endtime)
    print("Elapsed time =", elapsed)

    return elapsed

numrecs = 1000
recoffset = 0 # 10000000    # record offset
starttime = timerstart()

indigo = Indigo()
renderer = IndigoRenderer(indigo)

# set render options
indigo.setOption("render-atom-color-property", "color")
indigo.setOption("render-coloring", True)
indigo.setOption("render-comment-position", "bottom")
indigo.setOption("render-comment-offset", "20")
indigo.setOption("render-background-color", 1.0, 1.0, 1.0)
indigo.setOption("render-output-format", "png")

# set data path (including data file) and output file path
datapath = r'../Download/CID-SMILES'
pngpath = r'./2D/'

# read subset of rows from data file
mycmd = "head -" + str(recoffset+numrecs) + " " + datapath + " | tail -" + str(numrecs) 
print(mycmd)
(out, err) = subprocess.Popen(mycmd, stdout=subprocess.PIPE, shell=True).communicate()

lines = str(out.decode("utf-8")).split("\n")
count = 0
for line in lines: 
    try:
        cols = line.split("\t")   # split on tab
        key = cols[0]             # cid in cols[0]
        smiles = cols[1]          # smiles in cols[1]
        mol = indigo.loadMolecule(smiles)
        s = "CID=" + key
        indigo.setOption("render-comment", s)
        #indigo.setOption("render-image-size", 200, 250)
        #indigo.setOption("render-image-size", 400, 500)
        renderer.renderToFile(mol, pngpath + key + ".png")
        count += 1
    except:
        print("Error processing line after", str(count), ":", line)
        pass

elapsedtime = timerstop(starttime)
print("Converted", str(count), "SMILES to PNG")

Generate 2d images of molecules from PubChem FTP data

3 Answers3