I've been fighting with this for an hour now. I'm parsing an XML-string with iterparse
. However, the data is not encoded properly, and I am not the provider of it, so I can't fix the encoding.
Here's the error I get:
lxml.etree.XMLSyntaxError: line 8167: Input is not proper UTF-8, indicate encoding !
Bytes: 0xEA 0x76 0x65 0x73
How can I simply ignore this error and still continue on parsing? I don't mind, if one character is not saved properly, I just need the data.
Here's what I've tried, all picked from internet:
data = data.encode('UTF-8','ignore')
data = unicode(data,errors='ignore')
data = unicode(data.strip(codecs.BOM_UTF8), 'utf-8', errors='ignore')
Edit:
I can't show the url, as it's a private API and involves my API key, but this is how I obtain the data:
ur = urlopen(url)
data = ur.read()
The character that causes the problem is: å
, I guess that ä
& ö
, etc, would also break it.
Here's the part where I try to parse it:
def fast_iter(context, func):
for event, elem in context:
func(elem)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
def process_element(elem):
print elem.xpath('title/text( )')
context = etree.iterparse(StringIO(data), tag='item')
fast_iter(context, process_element)
Edit 2:
This is what happens, when I try to parse it in PHP. Just to clarify, F***ing Åmål is a drama movie =D
The file starts with <?xml version="1.0" encoding="UTF-8" ?>
Here's what I get from print repr(data[offset-10:offset+60])
:
ence des r\xeaves, La</title>\n\t\t<year>2006</year>\n\t\t<imdb>0354899</imdb>\n