How to extract 'Odor' information from PubChem using BeautifulSoup

Question

I wrote the following Python code extract 'odor' information from PubChem for a particular molecule; in this case molecule nonanal (CID=31289) The webpage for this molecule is: https://pubchem.ncbi.nlm.nih.gov/compound/31289#section=Odor

import requests
from bs4 import BeautifulSoup

url = 'https://pubchem.ncbi.nlm.nih.gov/compound/31289#section=Odor'
page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')
odor_section = soup.find('section', {'id': 'Odor'})
odor_info = odor_section.find('div', {'class': 'section-content'})

print(odor_info.text.strip())

I get the following error. AttributeError: 'NoneType' object has no attribute 'find' It seems that not the whole page information is extracted by BeautifulSoup.

I expect the following output: Orange-rose odor, Floral, waxy, green

The page you're trying to scrap is generated with javascript, you either need to use an interpreter (e.g. selenium) or to access the data using NCBI's API — mozway, Feb 18 '23 at 13:31
Thanks, I was not aware of that. I will try to use Selenium right now... — John Mommers, Feb 18 '23 at 13:36
Standard debugging step: view the page in a browser, with JavaScript disabled. Also look at the page source. — Karl Knechtel, Feb 18 '23 at 13:40

Yarin_007 · Accepted Answer · 2023-02-18T14:25:57.307

The page in question makes an AJAX request to load its data. We can see this in a web browser by looking at the Network tab of the dev tools (F12 in many browsers):

enter image description here

That is to say, the data simply isn't there when the initial page loads - so it isn't found by BeautifulSoup.

To solve the problem:

use Selenium, which can actually run the JavaScript code and thus populate the page with the desired data; or
simply query the API according to the request seen when loading the page in the browser. Thus:

PubChem_Nonanal_CID=31289
coumpund_data_url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{}/JSON/'
compound_info = requests.get(coumpund_data_url.format(PubChem_Nonanal_CID))

print (compund_info.json())

Parsing the JSON Reply

Parsing it proves a bit of a challenge, as it is comprised of many lists. If the order of properties isn't guaranteed, you could opt for a solution like this:

for section in compund_info.json()['Record']['Section']:
    if section['TOCHeading']=="Chemical and Physical Properties":
       for sub_section in section['Section']:
           if sub_section['TOCHeading'] == 'Experimental Properties':
               for sub_sub_section in sub_section['Section']:
                   if sub_sub_section['TOCHeading']=="Odor":
                       print(sub_sub_section['Information'][0]['Value']['StringWithMarkup'][0]['String'])
                       break

Otherwise, follow the schema from a JSON-parsing website like jsonformatter.com

# object►Record►Section►3►Section►1►Section►2►Information►0►Value►StringWithMarkup►0►String`

odor = compund_info.json()['Record']['Section'][3]['Section'][1]['Section'][2]['Information'][0]['Value']['StringWithMarkup'][0]['String']

How to extract 'Odor' information from PubChem using BeautifulSoup

1 Answers1