Reading a page and parsing it with minidom.parse or minidom.parseString in Python?

Question

I have either of these codes:

import urllib
from xml.dom import minidom
res = urllib.urlopen('https://www.google.com/webhp#q=apple&start=10')
dom = minidom.parse(res)

which gives me the error xml.parsers.expat.ExpatError: syntax error: line 1, column 0

Or this:

import urllib
from xml.dom import minidom
res = urllib.urlopen('https://www.google.com/webhp#q=apple&start=10')
dom = minidom.parseString(res.read())

which gives me the same error. res.read() reads fine and is a string.

I would like to parse through the code later. How can I do this using xml.dom.minidom?

Is using `xml.dom.minidom` a requirement, or you are open to using other modules? Believe me, you should make a switch :) — alecxe, Jul 31 '14 at 23:54
The documentation for `xml.dom.minidom` does everything I need to do. I prefer to use the standard libraries unless it's really necessary to do otherwise. I saw a bunch of recommendations for `BeautifulSoup`, but I have no use for it if `xml.dom.minidom` works fine. — JVE999, Jul 31 '14 at 23:55
To parse web pages, you should use a HTML parser rather than an XML parser. — pts, Jul 31 '14 at 23:55
@JVE999 there are a lot of `BeautifulSoup` recommendations just because it really makes html-parsing easy and intuitive. It would save you time and make web-scraping fun. — alecxe, Jul 31 '14 at 23:58

score 4 · Accepted Answer · answered Jul 31 '14 at 23:55

The reason you're getting this error is that the page isn't valid XML. It's HTML 5. The doctype right at the top tells you this, even if you ignore the content type. You can't parse HTML with an XML parser.*

If you want to stick with what's in the stdlib, you can use html.parser (Python 3.x) / HTMLParser (2.x).** However, you may want to consider third-party libraries like lxml (which, despite the name, can parse HTML), html5lib, or BeautifulSoup (which wraps up a lower-level parser in a really nice interface).

* Well, unless it's XHTML, or the XML output of HTML5, but that's not the case here.

** Do not use htmllib unless you're using an old version of Python without a working HTMLParser. This module is deprecated for a reason.

Here's a SO answer describing parsing HTML with `HTMLParser` for reference: http://stackoverflow.com/questions/3276040/how-can-i-use-the-python-htmlparser-library-to-extract-data-from-a-specific-div — JVE999, Aug 01 '14 at 00:16

Reading a page and parsing it with minidom.parse or minidom.parseString in Python?

1 Answers1