How to parse an html webpage that executes javascript

Question

I am trying to write a program that scraps for the IUPACcondensed on this webpage .

Here G03307GF is the ID. I need this:

HexNAc(b1-?)[Fuc(a1-?)]GlcNAc(b1-2)Man(a1-3)[HexNAc(b1-?)[Fuc(a1-?)]GlcNAc(b1-2)Man(a1-6)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc

I tried to use selenium for this.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome('', options = chrome_options)

# takes accession number and returns IUPAC
def getIUPAC(acc_no):

    url = 'https://glytoucan.org/Structures/Glycans/' + acc_no

    driver.get(url)
    IUPAC = driver.find_element_by_xpath('//*[@id="descriptors"]/togostanza-iupaccondensed//main/div/pre/code/text()')
    driver.close()

    return IUPAC

IUPAC = getIUPAC('G37498VS')

print(IUPAC)

It says the the element does not exist.

Possible duplicate of [Can Xpath expressions access shadow-root elements?](https://stackoverflow.com/questions/49763626/can-xpath-expressions-access-shadow-root-elements) — Smart Manoj, May 25 '19 at 03:55
https://stackoverflow.com/questions/28911799/accessing-elements-in-the-shadow-dom — Smart Manoj, May 25 '19 at 04:02

score 2 · Accepted Answer · answered May 25 '19 at 03:55

import re
import requests

def getIUPAC(acc_no):
    ret = requests.get('https://glytoucan.org/Structures/Glycans/{}'.format(acc_no))
    z = re.search('<meta name="description".*?The IUPAC representation is (.+)\.\s+The', ret.content, re.DOTALL | re.MULTILINE)
    return z if z else 'Unknown'


print('IUPAC is {}'.format(getIUPAC('G03307GF')))

Our result is...

IUPAC is HexNAc(b1-?)[Fuc(a1-?)]GlcNAc(b1-2)Man(a1-3)[HexNAc(b1-?)[Fuc(a1-?)]GlcNAc(b1-2)Man(a1-6)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc

That worked! Thanks. I did not use 're' because there was some issue with bytes but the 'requests' thing worked. — Shaurya, May 27 '19 at 21:21

score 0 · Answer 2 · answered May 25 '19 at 19:45

0

Better use requests as shown by VeNoMouS. Just wanted to add that you're getting element does not exist because the driver was closed before you printed it.

answered May 25 '19 at 19:45

Alan

71
3

How to parse an html webpage that executes javascript

2 Answers2