Parse HTML from DOM (Not static HTML)

Question

Trying to parse HTML data from the DOM, when I use Chrome's Developer Tools I can see that data in the console. When I save the page as HTML locally and search for the targeted data it can't be found. I've done some reading about how the static HTML file is what the browser receives, and how Javascript will render it differently for presentation.

Specific example: Google "nba", there results include a table at top of the page with all the scheduled games for the day nested inside a <tbody>, if you save this page, the HTML file does not contain a <tbody> tag. Trying to parse this table of games using BeautifulSoup4 with Python.

I don't think you can make Chrome save the current DOM state in a straightforward fashion. IIRC Firefox can do this, and you can use the web inspector to copy the DOM as HTML by rightclicking an element and save that into a file. — millimoose, Dec 16 '12 at 01:38
Instead of having BeautifulSoup4 act on a saved file, you could also use a tool that drives a real browser (and thus supports Javascript/AJAX) do the screenscraping. WATIR and friends for Ruby, and PhantomJS work this way. — millimoose, Dec 16 '12 at 01:39
You could also look for a proper API to get the data you want directly instead of screenscraping. — millimoose, Dec 16 '12 at 01:41
You can write a simple browser addon for doing this. That'll make it easy to parse the data as well, since you can use DOM methods for getting the content right off the page. — techfoobar, Dec 16 '12 at 03:06
@millimoose, thanks for the info, as Matt guessed I'm trying to do it programmatically. API data provider won't work as I'm just using the NBA as an example for learning about HTML parsing. — user1347648, Dec 16 '12 at 04:00
@user1347648 There's a difference between "parsing HTML" and "screenscraping". Parsing HTML is a subset of screenscraping. As you've learned, it's not a sufficient subset when it comes to websites that do a lot of work on the client. What I'm telling you is that BeautifulSoup simply can't accomplish what you're trying to do, and you need to use a different tool. — millimoose, Dec 16 '12 at 04:24

score 1 · Accepted Answer · edited May 23 '17 at 11:55

1

To do this completely programmatically, you need to run a headless browser – something that executes JavaScript just like your real browser. Ghost.py can make this easier.

Otherwise you can do as millimoose suggests, and save the current DOM state as HTML by using your browser's built-in developer tools.

edited May 23 '17 at 11:55

Community

1
1

answered Dec 16 '12 at 01:37

Matt Ball

354,903
100
647
710

Aiming for completely programmatically, configured ghost.py, and used the following test code: `from ghost import Ghost ghost = Ghost() page, resources = ghost.open('http://www.google.ca/#output=search&q=nba') print page.__dict__` but I don't see any of the HTML content I need to parse, is there something I'm missing? – user1347648 Dec 16 '12 at 03:53
Maybe, having some difficulty in using the httpresource object. I can see the dict contains the URL, headers, reply and http_status. I'm guessing I have to somehow extract the reply, and somehow convert it into HTML data? – user1347648 Dec 16 '12 at 04:04
Thanks that helped, I now have a string that looks like HTML before JS acts on it (like when I save it from a browser), but still not the content I'm trying to parse (like in Chrome's DOM inspector). I know I have to execute the JS by using ghost.evaluate but not sure what to use for the script parameter. – user1347648 Dec 16 '12 at 04:27

Parse HTML from DOM (Not static HTML)

1 Answers1