I am trying to parse an HTML table into python (2.7) with the solutions in this post. When I try either one of the first two with a string (as in the example) it works perfect. But when I try to to use the etree.xml on HTML page I read with urlib I get an error. I did a check for each one of solutions, and the variable I pass is a str as well. For the following code:
from lxml import etree
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = etree.XML(s)
I get this error:
File "C:/Users/user/PycharmProjects/Wikipedia/TestingFile.py", line 9, in table = etree.XML(s)
File "lxml.etree.pyx", line 2723, in lxml.etree.XML (src/lxml/lxml.etree.c:52448)
File "parser.pxi", line 1573, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:79932)
File "parser.pxi", line 1452, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:78774)
File "parser.pxi", line 960, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:75389)
File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71739)
File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72614)
File "parser.pxi", line 585, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71955) lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: link line 8 and head, line 8, column 48
and for this code:
from xml.etree import ElementTree as ET
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = ET.XML(s)
I get this error:
Traceback (most recent call last): File "C:/Users/user/PycharmProjects/Wikipedia/TestingFile.py", line 6, in table = ET.XML(s)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1300, in XML parser.feed(text)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1642, in feed self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror raise err xml.etree.ElementTree.ParseError: mismatched tag: line 8, column 111