0

I am trying to scrape data from a table of presented as HTML using the BeautifulSoup and requests libraries but I could not get all the HTML code.

My code is as below. For brevity's sake I have not included the output.

from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError

url = 'https://www2.susep.gov.br/safe/Corretores/pesquisa.html'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'}

#try:

#Opening 
req = Request(url, headers = headers)

#Open url
response = urlopen(req)

#Read HTML
print(response.read())

But the code failed to read a <main class> division from the HTML. There is a table on the page that is present in the that has not been read.

enter image description here

  • 1
    Is it being created dynamically by JS? – ggorlen Oct 15 '20 at 22:54
  • I couldn't say, I'm pretty new at web scraping. How could you check that? – joaovitorpigozzo Oct 15 '20 at 23:05
  • Look at the page and see if there's a script tag that's injecting something into the DOM. Or look at the page and see that the elements aren't visible without JS, as you've done here. You can run the site with JS disabled and see if the elements show up. If the data is coming in through AJAX you could look at the requests tab in your dev tools, figure out where the endpoint that the data is coming from is, and try hitting it yourself to bypass scraping HTML. – ggorlen Oct 15 '20 at 23:09
  • Duplicate of [Web-scraping JavaScript page with Python](https://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python) – esqew Oct 15 '20 at 23:13

2 Answers2

2

The data is being dynamically generated via JS. If you go into your browser and disable Javascript in the dev tools, you will see that the webpage is basically empty.

You will either need to find out where the data is being obtained (via some web API) using a tool like HTTP Trace or use something like Selenium which will run the Javascript to load the HTML.

Cho'Gath
  • 448
  • 3
  • 9
-1

It looks like angular.js is being used to compile the html. Maybe try scraping from the inspection tool in your browser?

dang
  • 9
  • 2