Python/Dynamic Parsing:
can't parse anything inside

Question

I've been wanting to parse information from a particular website, and I have been having problems with the dynamic aspect. When a request is called in python for this site with BeautifulSoup, etc., everything in < div id="root" > isn't there.

According to the answer to this similar question -- Why isn't the html code inside div is being parsed? -- I tried to use a headless browser. I ended up trying to use selenium and splinter with the '--headless' options enabled for chrome.

I don't know whether the headless browser I chose is just the wrong one for this particular website's setup, or if its my code, so please give me suggestions if you have any.

Notes: Running on Ubunutu 20.04.1 LTS, and Python 3.8.3. If you want to suggest different headless browser prgorams, go ahead, but it needs to be compatible for all linux, mac, etc. and Python.

Below is a look at my most recent code. I've tried various ways to ".find" the button I want to click. Here I tried to use the xpath of the element I want, which I got through inspect:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--ignore-certificate-errors')

with Browser('chrome', options=options) as browser:
    browser.visit("http://gnomad.broadinstitute.org/region/16-2087388-2087428?dataset=gnomad_r2_1")
    print(browser.title)
    browser.find_by_xpath('//*[@id="root"]/div/div/div[2]/div/div[3]/section/div[2]/button').first.click()

The error message this gave me was:

File "etc/anaconda3/lib/python3.8/site-packages/splinter/element_list.py", line 42, in __getitem__ 
    return self._container[index]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "practice3.py", line 20, in
browser.find_by_xpath('//[@id="root"]/div/div/div[2]/div/div[3]/section/div[2]/button').first.click()
File "etc/anaconda3/lib/python3.8/site-packages/splinter/element_list.py", line 57, in first
return self[0]
File "etc/anaconda3/lib/python3.8/site-packages/splinter/element_list.py", line 44, in getitem
raise ElementDoesNotExist(
splinter.exceptions.ElementDoesNotExist: no elements could be found with xpath "//
[@id="root"]/div/div/div[2]/div/div[3]/section/div[2]/button"

Thanks!

score 0 · Answer 1 · edited Aug 05 '20 at 20:13

Your problem seems to be that you don't wait for the elements to fully load. I set up the environment of your piece of code and printed the source of the website, ran through the response with a html beautifier https://www.freeformatter.com/html-formatter.html#ad-output

Here I found that a div you want to access has a state of

<div class="StatusMessage-xgxrme-0 daewTb">Loading region...</div>

Which implies that the site is not fully loaded yet. To fix this, you can simply wait for the website to load, which selenium can do

from selenium.webdriver.support.ui import WebDriverWait
WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="root"]/div/div/div[2]/div/div[3]/section/div[2]/button')))

This will wait for the element to be loaded and clickable.

Here's the code snippet I tested on

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--ignore-certificate-errors')

with webdriver.Chrome("<path-to-driver>", options=options) as browser:
    browser.get("http://gnomad.broadinstitute.org/region/16-2087388-2087428?dataset=gnomad_r2_1")
    WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="root"]/div/div/div[2]/div/div[3]/section/div[2]/button')))
    print(browser.title)
    print(browser.page_source)
    b = browser.find_element_by_xpath('//*[@id="root"]/div/div/div[2]/div/div[3]/section/div[2]/button')
    browser.execute_script("arguments[0].click()", b)

Simply replace the <path-to-driver> with the path to your chrome webdriver.

The last bit is becuase I got an error from the click of the button, which selenium.common.exceptions.ElementClickInterceptedException: Message: element click intercepted: Element is not clickable with Selenium and Python solved.

Thanks a ton for the help! But for some reason the wait always raises a TimeOutException, even when I increase the number of seconds its supposed to wait for to like a 1000. Do you have any idea why that might be happening? — etcTryAgain, Aug 05 '20 at 23:03
Hm, you could catch the exception to see if anything has loaded by printing the source. — JPDF, Aug 06 '20 at 05:32

Python/Dynamic Parsing: can't parse anything inside

1 Answers1

Python/Dynamic Parsing:
can't parse anything inside