2

I am trying to read a gzip file which contains xml and unicode, but I'm getting an error. The code I am using is:

import gzip
import xml

path = "index.mjml.gz"
gzFile = gzip.open(path, mode='r')
gzContents = gzFile.read()
gzFile.close()

unicodeContents = gzContents.encode('utf-8')
xmlContent = xml.dom.minidom.parseString(unicodeContents)
# Do stuff with xmlContent

When I run this code I get the following error (fails on the line that starts with xmlContent)

/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/xml/dom/minidom.pyc in parseString(string, parser)
   1922     if parser is None:
   1923         from xml.dom import expatbuilder
-> 1924         return expatbuilder.parseString(string)
   1925     else:
   1926         from xml.dom import pulldom

/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/xml/dom/expatbuilder.pyc in parseString(string, namespaces)
    938     else:
    939         builder = ExpatBuilder()
--> 940     return builder.parseString(string)
    941 
    942 

/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/xml/dom/expatbuilder.pyc in parseString(self, string)
    221         parser = self.getParser()
    222         try:
--> 223             parser.Parse(string, True)
    224             self._setup_subset(string)
    225         except ParseEscape:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1141336: ordinal not in range(128)

I found a previous answer similar to this Reading utf-8 characters from a gzip file in python, but I'm still getting an error.

Is there a problem with the xml parser?

(I'm using Python 2.7.?)

Community
  • 1
  • 1
jlconlin
  • 14,206
  • 22
  • 72
  • 105
  • "Is there a problem with the xml parser?" Please post the stacktrace and let us know what line the error happens on. It's hard to tell whether the exception happens on `.decode(...)` or on `.parseString(...)`. If the error occurs on `.decode(...)` then the immediate problem is not with the XML parser. – Mike Samuel Nov 15 '11 at 22:38
  • Is `unicodeContents` supposed to be set to `gzContents.decode('utf-8')` or `gzContants.decode('utf-8')`? The spelling in the post is throwing me off, especially since the error message doesn't seem to be connected to that error at all. – Edwin Nov 15 '11 at 22:41
  • @Edwin I tried to indicate that the failure occurs on the last line—the line that starts with `xmlContent`. I'll add the remaining traceback. – jlconlin Nov 15 '11 at 22:50
  • @MikeSamuel yes, that is a typo that occurred when I was translating from my code to a simple example. As I mentioned, the error occurs on the line that starts `xmlContent`. – jlconlin Nov 15 '11 at 22:55
  • @ulidtko: Look at the paths in the traceback. No, it's not on Windows, and the OS is irrelevant. – John Machin Nov 16 '11 at 01:55
  • @ulidtko John is right, I was running on my Mac. – jlconlin Nov 16 '11 at 03:25
  • The answer to my question was actually in string formatting. It was a much simpler question than what I posted here. See http://stackoverflow.com/questions/8152820/how-to-do-string-formatting-with-unicode-emdash/8152840#8152840 for the answer. – jlconlin Nov 16 '11 at 14:16

1 Answers1

5

You can't pass a unicode string to xml.dom.minidom.parseString.

It has to be an appropriately encoded byte string:

>>> import xml.dom.minidom as xmldom
>>>
>>> source = u"""\
... <?xml version="1.0" encoding="utf-8"?>
... <root><text>Σὲ γνωρίζω ἀπὸ τὴν κόψη</text></root>
... """
>>> doc = xmldom.parseString(source.encode('utf-8'))
>>> print doc.getElementsByTagName('text')[0].toxml()
<text>Σὲ γνωρίζω ἀπὸ τὴν κόψη</text>

EDIT

Just to clarify - the stream read from the gzipped xml file should be passed directly to the parser without attempting to encode or decode it:

import gzip
import xml

path = "index.mjml.gz"
gzFile = gzip.open(path, mode='r')
gzContents = gzFile.read()
gzFile.close()

xmlContent = xml.dom.minidom.parseString(gzContents)

The parser will read the encoding from the xml declaration at the start of the file (or assume "utf-8" if there isn't one). It can then use this to decode the contents to unicode.

ekhumoro
  • 115,249
  • 20
  • 229
  • 336
  • +1 In fact the input to ANY xml parser should be a bytestream; it's the parser's job to decode the xml bytestream using the declaration if any at the front of the bytestream. Your script doesn't need to know what the encoding is. – John Machin Nov 16 '11 at 01:49
  • So I changed my `decode` to the `encode` as @ekhumoro suggests, but I still get the same error. (To be honest, I don't completely understand the difference between encoding and decoding.) – jlconlin Nov 16 '11 at 03:31
  • @Jeremy. Sorry - I perhaps should have made things clearer. You don't need to use `encode` or `decode` - just pass the bytes read from the gzipped file (i.e. `gzContents`) directly to the xml parser; it will handle the decoding. – ekhumoro Nov 16 '11 at 03:43
  • @ekhumoro That's what I originally tried. I have further reduced the problem to printing with an emdash. I'll post a different question with a more precise example. – jlconlin Nov 16 '11 at 13:49
  • @ekhumoro, Hi, I have a similar problem with Suds, which uses python xml for soap envelop parsing/construction, would you please have a look at this question: http://stackoverflow.com/questions/15339141/suds-0-4-cant-handle-unicode-xml-sax-exceptions-saxparseexception-unknown , and let me know your answer based on your experience. Thanks in advance! – securecurve Mar 16 '13 at 14:09
  • @securecurve. I have no experience with Suds, so I'm afraid I probably can't help. However, having looked at the question you linked to, it seems you may have already solved your problem... – ekhumoro Mar 16 '13 at 19:54
  • @ekhumoro,I appreciate your help my friend. The solution I posted to my question didn't solve the problem, it only made the Unicode representation a string representation, which allowed suds to accept it, but still, it doesn't accept the Unicode input, the problem is not Suds's directly, it is in the python XML that python depends on, where it can't parse the Unicode bytecode I showed in my question, that's why I needed your help. Depending on that, how can python XML parse Unicode bytecode like this `'\xd9\x8a\xd9\x8a\xd9\x8a'`, as Suds boils down to an XML problem .. Thanks again. – securecurve Mar 17 '13 at 05:36