2

Update: The link in the answer is both interesting and useful, but unfortunately does not address the need for a java API, so I am still looking forward to any input.

I'm building a database of chemical compounds. I need all the synonyms (IUPAC and common names) as well as safety data for each.
I'll be using the freely available data at PubChem (http://pubchem.ncbi.nlm.nih.gov/)

There's an easy way of querying each compound with simple HTTP gets. For example, to obtain glycerol data, the URL is:

http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=753

And the following URL would return an easy to parse format:

http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=753&disopt=DisplaySDF

but it will respond only very basic info, lacking safety data and only a few common names.

There is one public domain API for JAVA that seems a very complete, developed by a group at Scripps (citation). The code is here.

Unfortunately, this API is not very well documented and it's quite difficult to follow due to the complexity of the data involved. For what I gathered, pubchemdb is using the PubChem Power User Gateway (PUG) XML API

Has anyone used this API (or any other one available)? I would appreciate a short description or tutorial on how to start with it.

Aleadam
  • 40,203
  • 9
  • 86
  • 108
  • This is probably a bit specialist for StackOverflow. Are there any cheminformatics communities you could try? – Tom Anderson May 09 '11 at 20:21
  • @Tom it might be quite specialized, but I'm hoping somebody here worked with these databases. There are quite a few ncbi questions answered here. I may contact the authors directly also. – Aleadam May 09 '11 at 23:13

1 Answers1

2

The Cactvs Chemoinformatics toolkit (free for academic/educational use) has full PubChem integration. Using the scripting environment, you can easily do something like

cactvs>ens create 753

ens0

cactvs>ens get ens0 E_NAMESET

PROPANE-1,2,3-TRIOL GLYCEROL 8043-29-6 29796-42-7 30049-52-6 37228-54-9 75398-78-6 78630-16-7 8013-25-0 175385-78-1 25618-55-7 64333-26-2 56-81-5 {Tegin M} LS-1377 G8773_SIGMA 15523_RIEDEL {Glycerin, natural} NCGC00090950-03 191612_ALDRICH 15524_RIEDEL {Glycerol solution} L-glycerol 49767_FLUKA {Biodiesel impurity} 49770_FLUKA 49771_FLUKA NCGC00090950-01 49927_FLUKA Glycerol-Gelatine G7757_SIAL GOL D-glycerol G9012_SIAL {Polyhydric alcohols} c0066 MOON {NSC 9230} G2025_SIGMA ZINC00895048 49781_FLUKA {Concentrated glycerin} {Concentrated glycerin (JP15)} D00028 {Glycerin (JP15/USP)} 44892U_SUPELCO {Glycerin, concentrated (JAN)} CRY 49782_FLUKA NCGC00090950-02 G6279_SIAL W252506_ALDRICH G7893_SIAL {Glycerin, concentrated} 33224_RIEDEL Bulbold Cristal Glyceol G9281_SIGMA Glycerol-1,2,3-3H G1901_SIGMA G7043_SIGMA 1,2,3-trihydroxypropane 1,2,3-trihydroxypropanol glycerin G2289_SIAL G9406_SIGMA {Glycerol-[2-3H]} CHEBI:17754 Glyzerin Oelsuess InChI=1/C3H8O3/c4-1-3(6)2-5/h3-6H,1-2H {90 Technical glycerine} Dagralax {Glycerin, anhydrous} {Glycerin, synthetic} Glycerine Glyceritol {Glycyl alcohol} Glyrol Glysanin NSC9230 Ophthalgan Osmoglyn Propanetriol {Synthetic glycerin} {Synthetic glycerine} Trihydroxypropane Vitrosupos {WLN: Q1YQ1Q} Glycerol-1,3-14C {4-01-00-02751 (Beilstein Handbook Reference)} AI3-00091 {BRN 0635685} {CCRIS 2295} {Caswell No. 469} {Citifluor AF 2} {Clyzerin, wasserfrei [German]} {EINECS 200-289-5} {EPA Pesticide Chemical Code 063507} {FEMA No. 2525} {Glicerina [DCIT]} {Glicerol [INN-Spanish]} {Glycerin (mist)} {Glycerin [JAN]} {Glycerin mist} {Glycerine mist} Glycerinum {Glycerolum [INN-Latin]} Grocolene {HSDB 492} IFP {Incorporation factor} 1,2,3-Propanetriol C00116 Optim {Propanetriol (VAN)} {1,2,3-PROPANETRIOL, HOMOPOLYMER} {Glycerol polymer} {Glycerol, polymers} {HL 80} {PGL 300} {PGL 500} {PGL 700} Polyglycerin Polyglycerine Polyglycerol {Unigly G 2} {Unigly G 6} G5516_SIGMA MolMap_000024

cactvs>

This hides all PUG ugliness - but in any case, I dare say that PUG is well documented. The toolkit goes much beyond simple data downloads - you can even open and query PubChem like a local SD file if you want to.

PubChem does not contain safety data, though. And safety data is country/region-dependent, strictly regulated, and you should be really careful not to be hit with liabilities. Have your approach checked by legal personnel!

  • Thank you for your reply. I downloaded cactvs pdf documentation and I will go over it. It looks very interesting. – Aleadam May 10 '11 at 21:40