0

I need to retrieve a lot of information from multiple xml files. I'm trying to make a webscraper, but I have problems with the encoding while still stripping all the namespaces (see code). The content of the xml files is written in danish and contains the special characters "æøå".

How can I change the file encoding of the printed xml data while still stripping namespaces?

import urllib
from StringIO import StringIO
from xml.etree import ElementTree as ET
import re

url = "http://loremIpsum.co "
xmlString = urllib.urlopen(url).read() #data with namespaces

it = ET.iterparse(StringIO(xmlString))

for _, el in it:
    if '}' in el.tag:
        el.tag = el.tag.split('}', 1)[1]  # strip all namespaces
root = it.root


print root.findtext("loremIpsum/loremIpsum")

Current print output if root.findtext("loremIpsum/loremIpsum") were the special character "ø":

u'\xd8

Expected output:

ø
  • You forgot to described your problem. Errors, stack traces, etc... – user590028 Apr 28 '16 at 15:43
  • it's an encoding error..? u'\xd8st != Øst – Thomas Perkov Apr 28 '16 at 15:53
  • Thomas -- I know you're new to SO. You should review how to post questions. Without the stack trace, there is no way to understand the context of the error -- for example, what line number generated the error. Take a few minutes to read the guidance and revise your question. – user590028 Apr 28 '16 at 16:12
  • I have tried to revise my question again.. I'm not getting any errors, I just need to change the encoding.. – Thomas Perkov Apr 28 '16 at 16:31
  • Please read http://stackoverflow.com/help/how-to-ask. If you are describing something that's not an error -- then explain exactly what you expect to see. – user590028 Apr 28 '16 at 16:48
  • When the content in the xml file contain "æ", "ø" and "å" I expect it to print "æ", "ø" and "å" its not.. As an example it currently printing u'\xd8 instead of ø ... Can you explain how I change the encoding? – Thomas Perkov Apr 28 '16 at 18:39
  • http://stackoverflow.com/questions/6344853/python-unicode-in-windows-terminal-encoding-used#6349430 – user590028 Apr 28 '16 at 18:55

0 Answers0