etree generating error when using urlib

Question

I am trying to parse an HTML table into python (2.7) with the solutions in this post. When I try either one of the first two with a string (as in the example) it works perfect. But when I try to to use the etree.xml on HTML page I read with urlib I get an error. I did a check for each one of solutions, and the variable I pass is a str as well. For the following code:

from lxml import etree
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = etree.XML(s)

I get this error:

File "C:/Users/user/PycharmProjects/Wikipedia/TestingFile.py", line 9, in table = etree.XML(s)

File "lxml.etree.pyx", line 2723, in lxml.etree.XML (src/lxml/lxml.etree.c:52448)

File "parser.pxi", line 1573, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:79932)

File "parser.pxi", line 1452, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:78774)

File "parser.pxi", line 960, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:75389)

File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71739)

File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72614)

File "parser.pxi", line 585, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71955) lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: link line 8 and head, line 8, column 48

and for this code:

from xml.etree import ElementTree as ET
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = ET.XML(s)

I get this error:

Traceback (most recent call last): File "C:/Users/user/PycharmProjects/Wikipedia/TestingFile.py", line 6, in table = ET.XML(s)

File "C:\Python27\lib\xml\etree\ElementTree.py", line 1300, in XML parser.feed(text)

File "C:\Python27\lib\xml\etree\ElementTree.py", line 1642, in feed self._raiseerror(v)

File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror raise err xml.etree.ElementTree.ParseError: mismatched tag: line 8, column 111

score 0 · Accepted Answer · answered Dec 06 '15 at 18:11

While they may seem the same markup types, HTML is not as stringent as XML to be well-formed and follow markup rules (opening/closing nodes, escaping entities, etc.). Hence, what passes for HTML may not be allowed for XML.

Therefore, consider using etree's HTML() function to parse the page. Additionally, you can use XPath to target the particular area you intend to extract or use. Below is an example attempting to pull the main page's table. Do note the webpage uses a quite a bit of nested tables.

from lxml import etree
import urllib.request as rq
yearurl = "http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s = rq.urlopen(yearurl).read()
print(type(s))

# PARSE PAGE
htmlpage = etree.HTML(s)

# XPATH TO SPECIFIC CONTENT
htmltable = htmlpage.xpath("//table[tr/td/font/a/b='Rank']//text()")

for row in htmltable:
    print(row)

etree generating error when using urlib

1 Answers1