I am trying to parse HTML page in python using urllib2 and ElementTree and I am facing trouble parsing the HTML. Webpage contains "&" within quoted string but ElementTree throws parseError for lines containing &
Script:
import urllib2
url = 'http://eciresults.nic.in/ConstituencywiseU011.htm'
req = urllib2.Request(url, headers={'Content-type': 'text/xml'})
r = urllib2.urlopen(req).read()
import xml.etree.ElementTree as ET
htmlpage=ET.fromstring(r)
This throws following error in Python 2.7
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1282, in XML
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1624, in feed
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1488, in _raiseerror
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 676, column 73
Error corresponds to following line
<input type="hidden" id="HdnFldAndamanNicobar" value="1,Andaman & Nicobar Islands;" />
Looks like when HTML page is read, & sign is not parsed as &
in variable r
I tried to parse using htmlTreeParse using R program and "&" gets converted to &
properly.
Let me know if I am missing anything in urllib2
EDIT : I replaced "&" to &
but line 904 contains < sign within javascript which throws same error. There should be a better option rather than replacing characters.
LINE:904 for (i = 0; i < strac.length - 1; i++) {