I need to retrieve a lot of information from multiple xml files. I'm trying to make a webscraper, but I have problems with the encoding while still stripping all the namespaces (see code). The content of the xml files is written in danish and contains the special characters "æøå".
How can I change the file encoding of the printed xml data while still stripping namespaces?
import urllib
from StringIO import StringIO
from xml.etree import ElementTree as ET
import re
url = "http://loremIpsum.co "
xmlString = urllib.urlopen(url).read() #data with namespaces
it = ET.iterparse(StringIO(xmlString))
for _, el in it:
if '}' in el.tag:
el.tag = el.tag.split('}', 1)[1] # strip all namespaces
root = it.root
print root.findtext("loremIpsum/loremIpsum")
Current print output if root.findtext("loremIpsum/loremIpsum")
were the special character "ø":
u'\xd8
Expected output:
ø