Python - Getting HTML with DOM

Question

I have a flash card making program for Spanish that pulls information from here: http://www.spanishdict.com/examples/zorro (this is just an example). I've set it up so it gets the translations fine, but now I want to add examples. I noticed however, that the examples on that page are dynamically generated so I installed Beautiful Soup and HTML5 parser. The tag I'm specifically interested in is:

<span class="megaexamples-pair-part">Los perros siguieron el rastro del <span 
class="megaexamples-highlight">zorro</span>. </span>

The code I'm using to try and retrieve it is:

soup = BeautifulSoup(urlopen("http://www.spanishdict.com/examples/zorro").read(), 'html5lib')
example = soup.findAll("span", {"class": "megaexamples-pair-part"})

However, no matter what way I swing it, I can't seem to get it to pull down the dynamically generated code. I have confirmed I get the page by doing a search for megaexamples-container, which works fine (and you can see by just right clicking in google chrome and hitting View Page Source).

Any ideas?

The content may be generated after load by javascript. [Check this answer.](https://stackoverflow.com/questions/13960567/reading-dynamically-generated-web-pages-using-python) — Simon T, Jun 16 '17 at 15:13

score 1 · Answer 1 · answered Jun 16 '17 at 15:11

What you're doing is just pull the HTML page, and it's likely loading more data from the server via a JavaScript call.

You have 2 options:

Use a webdriver such as selenium to control a web browser that correctly loads the entire page (you can then parse it with BeautifulSoup or find elements with selenium's own tools). This incurs in some overhead due to the browser usage.
Use the network tab of your browser's developer tools (usually accessed with F12) to analyze incoming and outgoing requests from dynamic loading and use the requests module to replicate them. This is more efficient but might also be more tricky.

Remember to do this only if you have permission from the site's owner, though. In many cases it's against the ToS.

score 0 · Accepted Answer · answered Jun 16 '17 at 23:27

I used Pedro's answer to get me moving in the right direction. Here is what I did to get it to work:

Download selenium with pip install selenium
Download the driver for the browser you want to emulate. You can download them from this page. The driver must be in the PATH variable or you will need to specify the path in the constructor for the webdriver.
Import selenium with from selenium import webdriver
Now use the following code:

browser = webdriver.Chrome()
browser.get(raw_input("Enter URL: "))
html_source = browser.page_source

Note: If you did not put your driver in path, you have to call the constructor with browser = webdriver.Chrome(<PATH_TO_DRIVER_HERE>)

Note 2: You can use something like webdriver.Firefox() if you want a different browser.

Now you can parse it with something like: soup = BeautifulSoup(html_source, 'html5lib')

Python - Getting HTML with DOM

2 Answers2