0

I'd like to systematically scrape the privacy breach data found here which is directly embedded in the HTML of the page. I've found various links on StackOverflow about missing HTML and not being able to scrape a table using BS4. Both of these threads seem very similar to the issue that I'm having, however i'm having a difficult time reconciling the differences.

Here's my problem: When I pull the HTML using either Requests or urllib (python 3.6) the second table does not appear in the soup. The second link above details that this can occur if the table/data is added in after the page loads using javascript. However when I examine the page source the data is all there, so that doesn't seem to be the issue. A snippet of my code is below.

url = 'https://www.privacyrights.org/data-breach/new?title=&page=1'
r = requests.get(url, verify=False)
soupy = BeautifulSoup(r.content, 'html5lib')
print(len(soupy.find_all('table')))
# only finds 1 table, there should be 2

This code snippet fails to find the table with the actual data in it. I've tried lmxl, html5lib, and html.parse parsers. I've tried urllib and Requests packages to pull down the page.

Why can't requests + BS4 find the table that I'm looking for?

Community
  • 1
  • 1
Alexander
  • 1
  • 2
  • I only see one table. What second table where you expecting? Are you certain the page doesn't contain JavaScript code that alters the DOM in the browser to add tables? – Martijn Pieters Jul 28 '16 at 13:36
  • It is possible, but I'm not all that familiar with javascript. When I view the page source the data is in the table with class="data-breach-table". – Alexander Jul 28 '16 at 13:40
  • That table is not part of the source served by the request. This is not a BeautifulSoup problem, you need to use selenium to drive a browser and execute the JavaScript code that loads that table, or reverse-engineer the page code and figure out how that table is constructed. – Martijn Pieters Jul 28 '16 at 13:42
  • Excellent, I will look into that. Thank you! – Alexander Jul 28 '16 at 13:43

1 Answers1

0

Looking at the HTML delivered from the URL it appears that there only IS one table in it, which is precisely why Beautiful Soup can't find two!

holdenweb
  • 33,305
  • 7
  • 57
  • 77