python 3 Decode string staÃŸe

Question

how can I decode a string containing stuff like this:

sta&#195;&#159;e

to

staße

using python.

(EDIT: Interpreting the source as html entities does not lead to the desired result, but "staÃe")

Background: I am struggling to work with the amazon mws response-strings using the mws client you get when doing pip install mws. Especially wondering because the sourcestring looks like it contains 2 special characters, but the goal is just 'ß'.

In the docs they are talking about a Unicode character limit i did not understand

Tried [Decode HTML entities in Python string?](https://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string) but doesn't do the job. — Jean-François Fabre, Sep 13 '17 at 11:52
Was my first idea as well, but interpreting the escaped characters as html entities (as in your linked thread) leads to "staÃe". I tried that at http://www.convertstring.com/de/EncodeDecode/HtmlDecode — Telcrome, Sep 13 '17 at 11:56

score 3 · Accepted Answer · answered Sep 13 '17 at 12:35

3

Well, the problem here is that ß is represented in UTF-8 as the sequence of two bytes: C3 9F hex or 195 159 decimal. However, as you decode your entities as HTML, they end up as Unicode code points 195 and 159, 195 being code point for Ã. You will have to do some voodoo, like casting the str to bytes and then decoding bytes to (Unicode) str. Compare the results of:

print('\xc3\x9f')

print(bytes('\xc3\x9f', 'Latin-1').decode())

answered Sep 13 '17 at 12:35

Błotosmętek

12,717
19
29

thanks, your snippet in combination with setting utf-8 as encoding on the output xml file solved the problem – Telcrome Sep 26 '17 at 12:48

python 3 Decode string staÃŸe

1 Answers1