Webscraping: Table not included in BeautifulSoup Page

Question

I am trying to scrape a table of company info from the table on this page: https://tools.ceres.org/resources/tools/sec-sustainability-disclosure/

I can see the table contents when using chrome's dev tool element inspector, but when I request the page in my script, the contents of the table are gone... just with no content.

Any idea how I can get that sweet, sweet content?

Thanks

Code is below:

import requests
from bs4 import BeautifulSoup
response = requests.get("https://tools.ceres.org/resources/tools/sec-sustainability-disclosure/")
page = BeautifulSoup(response.text, "html.parser")
page

It looks like the table content is loaded after the page, which means javascript is responsible for populating it. Because of that, you will have to use something likw Selenium to load the page first, and then BeautifulSoup to scrap it. Here's SO question, and check out my answer https://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python/50593885#50593885 — Biarys, Mar 07 '19 at 23:03
@Biarys, thanks for the tip. I tried using this condition: until(EC.presence_of_all_elements_located((By.ID, "companyList"))) but the table is still coming back blank — Cory, Mar 08 '19 at 02:20
np. As far as I can remember, it should open up a browser and go directly to the link. Try watching it and see whether it loads there. Perhaps, you are waiting for a wrong element. As a work around, you may try using time.sleep(10) and then try scraping — Biarys, Mar 08 '19 at 03:05
@Cory, The `tbody` element with id "companyList" is loaded with the page without any data in it. So you have to wait for rows in the table to appear. Try `EC.presence_of_element_located((By.XPATH, '//tbody[@id="companyList"]/tr'))` — Kamal, Mar 08 '19 at 06:12

score 0 · Answer 1 · answered Mar 07 '19 at 23:13

Based on the Network traffic using the dev tool, the content isn't directly on the html, but gets called dynamically from ApiService.js script. My suggestion would be to use Selenium to extract the content once the page has fully loaded (for example until the loading element has disappeared).

score 0 · Accepted Answer · answered Mar 08 '19 at 11:18

You can find the API in the network traffic tab: it's calling

https://tools.ceres.org/resources/tools/sec-sustainability-disclosure/@@api-disclosure?isabstract=0&companyName=&ticker=&year=2018&analysis=1&index=&sic=&keywords=

and you should be able to reconstruct the table from the resulting JSON. I haven't played around with all the parameters but it's seems like only year affects the resulting data set, i.e.

https://tools.ceres.org/resources/tools/sec-sustainability-disclosure/@@api-disclosure?isabstract=0&year=2018&analysis=1

should give you the same result as the query above.

Thanks Gregor! You're right, that's all the data I need, and I've learned something new about network traffic. Thanks for your help! — Cory, Mar 08 '19 at 16:14

Webscraping: Table not included in BeautifulSoup Page

2 Answers2