Python minidom and UTF-8 encoded XML with hash references

Question

I am experiencing some difficulty in my home project where I need to parse a SOAP request. The SOAP is generated with gSOAP and involves string parameters with special characters like the danish letters "æøå".

gSOAP builds SOAP requests with UTF-8 encoding by default, but instead of sending the special chatacters in raw format (ie. bytes C3A6 for the special character "æ") it sends what I think is called character hash references (ie. Ã¦).

I don't completely understand why gSOAP does it this way as I can see that it has marked the incomming payload as being UTF-8 encoded anyway (Content-Type: text/xml; charset=utf-8), but this is besides the question (I think).

Anyway I guess gSOAP probably is obeying transport rules, or what?

When I parse the request from gSOAP in python with xml.dom.minidom.parseString() I get element values as unicode objects which is fine, but the character hash references are not decoded as UTF-8 character codes. It unescapes the character hash references, but does not decode the string afterwards. In the end I have a unicode string object with UTF-8 encoding:

So if the string "æble" is contained in the XML, it comes like this in the request:

"&#195;&#166;ble"

After parsing the XML the unicode string in the DOM Text Node's data member looks like this:

u'\xc3\xa6ble'

I would expect it to look like this:

u'\xe6ble'

What am I doing wrong? Should I unescape the SOAP XML before parsing it, or is it somewhere else I should be looking for the solution, maybe gSOAP?

Thanks in advance.

Best regards Jakob Simon-Gaarde

score 1 · Answer 1 · answered Jan 12 '11 at 00:11

1

Ã¦ble is actually Ã¦ble.

To get the expected Unicode string u'\xe6ble' after parsing, the string in the request should be æble.

answered Jan 12 '11 at 00:11

mzjn

48,958
13
128
248

Ã¦ is the UTF-8 representation, and it is also the encoding that gSOAP advertises in the SOAP request - so it is correct. it should not be æ in the request. The problem is that minidom.parseString() does not seem to unescape the character hash representation before it decodes to unicode. – Jakob Simon-Gaarde Jan 12 '11 at 00:20
@Jakob Simon-Gaarde: The whole point of an HTML character reference is that you get ONE reference per Unicode codepoint -- `'Ã¦'` is a nonsense. – John Machin Jan 12 '11 at 00:45
ok that is interresting, so since the UTF-8 representation of "æ" is 0xC3A6, what I should expect from the gSOAP request is actually 쎦 if it insists on sending hashes, right? – Jakob Simon-Gaarde Jan 12 '11 at 07:50
Just tested this, but it gave the exact same result. – Jakob Simon-Gaarde Jan 12 '11 at 07:56
Not right at all. I'll amplify what @mzjn (correctly) said above. LATIN SMALL LIGATURE AE is the unicode codepoint U+00E6. 0xe6 == 230, so the character reference is `æ`. The validity of none of this is impacted at all by whatever encoding is used in transmitting the bytes. The UTF-8 representation of that codepoint is the pair of bytes `"\xc3\xa6"`; `쎦` represents the codepoint U+C3A6, which is HANGUL SYLLABLE SSYEONH. What did you test?? – John Machin Jan 12 '11 at 10:34

score 0 · Answer 2 · answered Jan 11 '11 at 23:56

Some more detail about my problem. The project I am creating uses wsgi. The SOAP request is extracted using environ['wsgi.input'].read(). It always seems to return a raw string. I created a function that unescapes the character hashes:

def unescape_hash_char(req):
  pat = re.compile('&#(\d+);',re.M)
  parts = pat.split(req)
  a=0
  ret = ''
  for p in parts:
    if a%2:
      n = chr(int(p))
    else:
      n = p
    ret += n
    a+=1
  return ret

After doing this I parse the XML and I get the expected reslut.

Still I would like to know what you think, and if it is a good solution. Also I wrote the function because I couldn't find a function to do the job in the standard python modules, does such a function exist?

Best regards Jakob Simon-Gaarde

Please don't answer your own question. Edit your question and add the information there. — John Machin, Jan 12 '11 at 00:47

John Machin · Accepted Answer · 2011-01-12T01:55:03.213

Here's how to unescape such stuff: http://effbot.org/zone/re-sub.htm#unescape-html

However the primary problem is what you and/or this "gSOAP" (URL, please) are doing ...

Your example character is LATIN SMALL LIGATURE AE (U+00E6). As you say, encoded in UTF-8, this is \xc3\xa6. 0xc3 == 195 and 0xa6 == 166. 0xe6 == 230. Escaping your character should produce 'æ', not 'Ã¦'.

However it appears that it is encoding to UTF-8 first and then doing the escaping.

What you need to do is to show us in fine detail the code that you are using together with diagnostic prints (using the repr() function so that we can see the type and unambiguously-represented contents) of each str and unicode object involved in the process. Also provide the docs for the gSOAP API(s) that you are using.

On the receiving end, please show us the repr() of the raw XML that you receive.

Edit in response to this comment on another answer: """The problem is that minidom.parseString() does not seem to unescape the character hash representation before it decodes to unicode."""

It (and any other XML parser) {does not, cannot in generality, and must not} unescape numerical character references or predefined character entities BEFORE decoding.

(1) unescaping "<" to "<" would blow up

(2) what would you unescape "&#256" to? "\xc4\x80"?

(3) how could it unescape at all if the encoding was UTF-16xx?

here is the SOAP request that gSOAP sends: http://pastebin.com/raw.php?i=9NS7vCMB — Jakob Simon-Gaarde, Jan 12 '11 at 01:20
gSOAP is a well tested framework, and probably isn't the party making the problem. — Jakob Simon-Gaarde, Jan 12 '11 at 01:21
"However it appears that it is encoding to UTF-8 first and then doing the escaping." exactly this is what I meant with: "The problem is that minidom.parseString() does not seem to unescape the character hash representation before it decodes to unicode." — Jakob Simon-Gaarde, Jan 12 '11 at 08:12
Let's disambiguate all of that: Whatever process is causing that stream of bytes to appear is INCORRECTLY encoding to UTF-8 and then doing escaping. XML parsers CORRECTLY (as I explained) must decode first then unescape second. — John Machin, Jan 12 '11 at 09:44

unutbu · Answer 4 · 2011-01-12T00:37:34.927

0

Note that

In [5]: 'æ'.encode('utf-8')
Out[5]: '\xc3\xa6'

So we have is the unicode object u'\xc3\xa6' and we really want the string object'\xc3\xa6'. This transformation can be performed with the raw-unicode-escape codec:

In [1]: text=u'\xc3\xa6'
In [2]: text.encode('raw-unicode-escape')
Out[2]: '\xc3\xa6ble'

In [3]: text.encode('raw-unicode-escape').decode('utf-8')
Out[3]: u'\xe6'

In [4]: print(text.encode('raw-unicode-escape').decode('utf-8'))
æ

edited Jan 12 '11 at 00:37

answered Jan 12 '11 at 00:25

unutbu

842,883
184
1,785
1,677

I am aware of this, but I shouldn't have to do manual decoding after parsing an XML document if it is valid. – Jakob Simon-Gaarde Jan 12 '11 at 10:01
It is valid. It's just rubbish, that's all. – John Machin Jan 12 '11 at 11:07

Jakob Simon-Gaarde · Answer 5 · 2011-01-12T10:02:10.377

0

Unless someone can tell me that gSOAP is not producing valid encoded SOAP XML: (see http://pastebin.com/raw.php?i=9NS7vCMB or the codeblock below) I see no other solution than to unescape character hash references before parsing the XML.

Of course as John Machin has pointed out, I cannot unescape XML control characters like "<" and ">".

<?xml version="1.0" encoding="UTF-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:ns1="urn:ShopService"><SOAP-ENV:Body SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"><ns1:createCompany><company-code>DK-123</company-code><name>&#195;&#166;ble</name></ns1:createCompany></SOAP-ENV:Body></SOAP-ENV:Envelope>

/ Jakob

edited Jan 12 '11 at 10:02

answered Jan 12 '11 at 08:29

Jakob Simon-Gaarde

675
8
26

(0) [again] don't answer your own question ... (1) relevant snippet: `Ã¦ble` That is highly likely to be legally valid encoded (but rather silly) SOAP XML, which an XML parser legally must turn into `u"\xc3\xa6ble"`, as expected by other users of XML parsers (2) """I see no other solution ...""" Why do you believe that unutbu's answer is not a solution? (3) You have been rather reticent about who or what is cranking the gsoap handle: is it under your control or influence or not? – John Machin Jan 12 '11 at 10:14
(2) It just seems wrong that I have to fix encodings manually on the reciever side if the XML from the sender side is valid. I would expect that if the string "æble" is transmitted in a "legal" XML request from the sender that it ends as "æble" u'\xe6ble' on the reciever side after parsing it. – Jakob Simon-Gaarde Jan 12 '11 at 10:54
Please get this through your head: It is NOT transmitting that string. It is transmitting some other rubbish. As I've explained, it's valid XML. There's no law against transmitting stuff that is valid XML but is meaningless rubbish. – John Machin Jan 12 '11 at 11:06
(3) I have control over the gSOAP C++ code that produces the SOAP XML I can pastebin code for you. I general I can say that the file containing the special character "æ" is saved in UTF-8. I use g++ to compile and my system (Ubuntu) is set up to da_DK.UTF-8. – Jakob Simon-Gaarde Jan 12 '11 at 11:08
You seem angry - Have i been offensive in some way? I have from the start been saying that I wasn't sure that the request XML was as it should be. I am no expert in XML escaping. So thanks, all I needed to know was that gSOAP is sending rubbish, so now I can start looking at the problem from that side – Jakob Simon-Gaarde Jan 12 '11 at 11:15
I am not angry. "from the start" you have been ignoring what I said in the second line of my answer: "However the primary problem is what you and/or this "gSOAP" (URL, please) are doing ..." and subsequent amplification; in fact 9 tenths of the answer was explaining why. I am glad that you are getting closer; "gSOAP is ..." is not quite the same as "you and/or this gSOAP is ..." but it's a start :-) – John Machin Jan 12 '11 at 11:34
I have to admit I didn't see your comment: "Let's disambiguate all of that: Whatever process is causing that stream of..." where you clearly write that it is valid XML but not UTF-8. Thanks :-) – Jakob Simon-Gaarde Jan 12 '11 at 11:48
Not clearly enough, it seems. Of course "it" is encoded in valid UTF-8; if not, the XML parser would have rejected it. The whole problem is that the creating mechanism is UTF-8-encoding first then escaping second ... it's the order that is wrong. I'll try again: Whatever process is causing that stream of bytes to appear is INCORRECTLY [bracket]encoding to UTF-8 and then doing escaping[/bracket]. XML parsers CORRECTLY (as I explained) must [bracket]decode first then unescape second[/bracket] which doesn't reverse out what the input mechanism does. – John Machin Jan 12 '11 at 19:28
in short: it is valid UTF-8 but there is no "æ" in it :-) – Jakob Simon-Gaarde Jan 13 '11 at 08:07

Python minidom and UTF-8 encoded XML with hash references

5 Answers5

Linked