Parsing links where Javascript is generating HTML?

Question

In the past, when I've used BeautifulSoup and lxml to parse webpages, it's been pretty easy because links all looked like this: <a href="www.website.com">Website</a>. However, I've encountered some webpages where links appear in the browser but not in the page source.

For example, on this Edmunds.com page, the Past Long-Term Road Tests section looks like this:

1991 Acura NSX
2011 Acura TSX Sport Wagon
...

However, the source code for the Past Long Long-Term Road Tests section of the page looks like this:

<script type="text/javascript">
PAGESETUP.addControl(function() {
function linksObj(){
var elink = "|acura|nsx|1991|long-term-road-test|"; //generates edmunds.com/acura/nsx/1991/long-term-road-test/
this.link0 = {anchor:elink,label:"1991 Acura NSX"};
var elink = "|acura|tsx-sport-wagon|2011|long-term-road-test|"; //generates edmunds.com/acura/tsx-sport-wagon/1991/long-term-road-test/
this.link1 = {anchor:elink,label:"2011 Acura TSX Sport Wagon"};
...
}
var links_obj = new linksObj();
var links_container = document.getElementById('links_list_offpage2');
var more_link = "";
var more_link_text = "";
var elinks = new EDMUNDS.linksList(links_obj, links_container,more_link, more_link_text);
}, 'low');
</script>

Tools like BeautifulSoup and lxml aren't finding the links that are generated in Javascript. How can I parse these links?

Copy the `EDMUNDS.linkList` function I guess – Explosion Pills Feb 15 '13 at 05:56 — Explosion Pills, Feb 15 '13 at 05:56

score 2 · Answer 1 · edited May 23 '17 at 12:04

2

Use a headless browser such as ghost.py to run the page's JavaScript, and you should have no problem scrapting the JS-altered DOM.

edited May 23 '17 at 12:04

Community

1
1

answered Feb 15 '13 at 05:58

Matt Ball

354,903
100
647
710

Parsing links where Javascript is generating HTML?

1 Answers1