34

In Python 2, Unicode strings may contain both unicode and bytes:

a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

I understand that this is absolutely not something one should write in his own code, but this is a string that I have to deal with.

The bytes in the string above are UTF-8 for ек (Unicode \u0435\u043a).

My objective is to get a unicode string containing everything in Unicode, which is to say Русский ек (\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a).

Encoding it to UTF-8 yields

>>> a.encode('utf-8')
'\xd0\xa0\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9 \xc3\x90\xc2\xb5\xc3\x90\xc2\xba'

Which then decoded from UTF-8 gives the initial string with bytes in them, which is not good:

>>> a.encode('utf-8').decode('utf-8')
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

I found a hacky way to solve the problem, however:

>>> repr(a)
"u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \\xd0\\xb5\\xd0\\xba'"
>>> eval(repr(a)[1:])
'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \xd0\xb5\xd0\xba'
>>> s = eval(repr(a)[1:]).decode('utf8')
>>> s
u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \u0435\u043a'
# Almost there, the bytes are proper now but the former real-unicode characters
# are now escaped with \u's; need to un-escape them.
>>> import re
>>> re.sub(u'\\\\u([a-f\\d]+)', lambda x : unichr(int(x.group(1), 16)), s)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a' # Success!

This works fine but looks very hacky due to its use of eval, repr, and then additional regex'ing of the unicode string representation. Is there a cleaner way?

Etienne Perot
  • 4,764
  • 7
  • 40
  • 50
  • 1
    There's no reliable way to solve this because the input data doesn't contain enough information in the first place. – Niklas B. Mar 23 '12 at 20:48
  • All the bytes in the input data are all UTF-8-encoded characters, so I think it is safe to assume that every sequence of bytes in the initial string can be safely decoded from UTF-8 – Etienne Perot Mar 23 '12 at 20:51
  • @NiklasB. is right - the UTF-8 encoded bytes are also valid Unicode codepoints so there's no way to tell what's what reliably. – Mark Ransom Mar 23 '12 at 20:52
  • @EtiennePerot, if you're starting with a UTF-8 byte sequence then please add it to the question. What you've shown us is a Unicode string which is NOT THE SAME! – Mark Ransom Mar 23 '12 at 20:53
  • Well then, I'm not really sure what that string is anymore... It is an object which representation starts with `u` (like unicode strings do) and which contains both `\uXXXX`'s (like unicode strings do) and `\xXX`'s (like byte strings do). All sequences of `\xXX`'s in the representation of the object also happen to be valid UTF-8 byte strings if they were byte strings (which they're not, because they're contained inside the unicode string). Not sure if that makes more sense, but I hope it did – Etienne Perot Mar 23 '12 at 20:58
  • What’s happening is that the second part of your string has been double-encoded, which causes it to appear to have a bunch of code points < 255, which interpreted as UTF-8 give the right value. – tchrist Mar 23 '12 at 21:20
  • I think your best bet is to figure out how such a crazy string was generated in the first place. Only then can you figure out the best way to fix it. You maybe be able to avoid modifying the code responsible, but you probably can't avoid understand it. – Winston Ewert Mar 23 '12 at 21:24
  • 1
    BTW, "Русский ек" doesn't seem to be valid either, it probably should read "Русский язык" (=Russian language), so I guess there's more than that broken. – georg Mar 23 '12 at 21:35
  • @tchrist indeed, if the data is indeed on disk nominally UTF-8 encoded, OP may be looking at a hopefully rare case of double-UTF. ;) – Karl Knechtel Mar 23 '12 at 21:39
  • @thg435 Nah, that's just because I took a substring of a word to keep this example string short enough (Full string was `Стандартный Захват Контрольных Точек`) – Etienne Perot Mar 23 '12 at 21:39
  • Just to be clear: the presence of `\xNN` escape sequences does not mean they're UTF-8 bytes. Python represents Unicode code points in the range 0 to 7F by `\x` escape sequences (other than printable ascii characters and `\n` `\t` etc). See code [here](https://github.com/python/cpython/blob/63718f4154b0cb0e0bc672b6c38e00dfef70d111/Objects/unicodeobject.c#L6121,L6239). Try this code: `for n in range(300): print hex(n), repr(unichr(n))`. For example, [the character Ð (U+00D0)](https://codepoints.net/U+00D0) will be represented by `\xd0` rather than `\u+00d0` even in a Unicode string. – ShreevatsaR Nov 27 '16 at 23:45

6 Answers6

23

In Python 2, Unicode strings may contain both unicode and bytes:

No, they may not. They contain Unicode characters.

Within the original string, \xd0 is not a byte that's part of a UTF-8 encoding. It is the Unicode character with code point 208. u'\xd0' == u'\u00d0'. It just happens that the repr for Unicode strings in Python 2 prefers to represent characters with \x escapes where possible (i.e. code points < 256).

There is no way to look at the string and tell that the \xd0 byte is supposed to be part of some UTF-8 encoded character, or if it actually stands for that Unicode character by itself.

However, if you assume that you can always interpret those values as encoded ones, you could try writing something that analyzes each character in turn (use ord to convert to a code-point integer), decodes characters < 256 as UTF-8, and passes characters >= 256 as they were.

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
  • I guess I do have to make that assumption in order to get the unicode string to behave. I understand that it may break in case the assumption fails and the string should hold a < 256 code-point, but at this point I think that trusting the assumption would do more good than harm. In retrospect, kev's answer does exactly that, but I'd rather accept your answer because it explains why it is a bad idea to do this in general. Thanks~ – Etienne Perot Mar 23 '12 at 21:22
  • 1
    You can isolate high-order ASCII chars (x80-xFF) and then _try_ converting them from utf8. If this succeeds, this is most probably correct, because normal texts are unlikely to contain utf8 sequences (`î` anyone?), otherwise leave them as is. – georg Mar 23 '12 at 21:38
  • @thg435 That’s exactly what my easy Perl solution does; but for some reason in Python you have go through a lot more hassle; see @Kev’s answers and the comments. I’m surprised the accepted answer hasn’t shown exactly how to do it. – tchrist Mar 23 '12 at 21:56
  • @tchrist: I posted an [example](http://stackoverflow.com/a/9847114/989121) of what I meant, it's more verbose than your perl snippet, but still concise. – georg Mar 23 '12 at 22:13
  • > "It just happens that the `repr` for Unicode strings in [Python] prefers to represent characters with `\x` escapes where possible" — Indeed, and [this seems to be the relevant code (as of today)](https://github.com/python/cpython/blob/b82a5a65caa5b0f0efccaf2bbea94f1eba19a54d/Objects/unicodeobject.c#L6121,L6239) in the CPython source which decides how to escape characters. Or you can just try something like: `for n in range(300): print hex(n), repr(unichr(n))` or (in [Python 3](https://www.python.org/shell/)) `for n in range(900): print(hex(n), repr(chr(n)), ascii(chr(n)))`. – ShreevatsaR Nov 28 '16 at 19:10
13

(In response to the comments above): this code converts everything that looks like utf8 and leaves other codepoints as is:

a = u'\u0420\u0443\u0441 utf:\xd0\xb5\xd0\xba bytes:bl\xe4\xe4'

def convert(s):
    try:
        return s.group(0).encode('latin1').decode('utf8')
    except:
        return s.group(0)

import re
a = re.sub(r'[\x80-\xFF]+', convert, a)
print a.encode('utf8')   

Result:

Рус utf:ек bytes:blää  
georg
  • 211,518
  • 52
  • 313
  • 390
  • Nice job, I was looking for some kind of `null` encoding like what `latin1` does, which would just return the Unicode codepoints < 256 unmodified. This is a lot more elegant than forcing a reinterpretation using the `chr(ord(c))` workaround, IMHO. – Niklas B. Mar 23 '12 at 22:14
  • 1
    Very good. Mind if I keep Karl Knechtel's answer as accepted though? I think anyone stumbling upon this question should rather be told why it is a bad idea to have these strings in the first place and why it is error-prone to try and fix them this way – Etienne Perot Mar 23 '12 at 22:20
12

The problem is that your string is not actually encoded in a specific encoding. Your example string:

a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

Is mixing python's internal representation of unicode strings with utf-8 encoded text. If we just consider the 'special' characters:

>>> orig = u'\u0435\u043a'
>>> bytes = u'\xd0\xb5\xd0\xba'
>>> print orig
ек
>>> print bytes
ек

But you say, bytes is utf-8 encoded:

>>> print bytes.encode('utf-8')
ек
>>> print bytes.encode('utf-8').decode('utf-8')
ек

Wrong! But what about:

>>> bytes = '\xd0\xb5\xd0\xba'
>>> print bytes
ек
>>> print bytes.decode('utf-8')
ек

Hurrah.

So. What does this mean for me? It means you're (probably) solving the wrong problem. What you should be asking us/trying to figure out is why your strings are in this form to begin with and how to avoid it/fix it before you have them all mixed up.

beerbajay
  • 19,652
  • 6
  • 58
  • 75
  • All of is is all true; I know this is not right and that those strings should never be in this form at all in the first place. These strings are from a Python module written by someone else (a MediaWiki API library called wikitools). I could maybe fix that module instead of trying to handle things myself but if there is a simple solution without having to edit that module, I'd rather go for the simple solution. – Etienne Perot Mar 23 '12 at 20:49
  • @Etienne: Problem is that there are a lot of cases which you haven't thought of (specifically cases where you can't tell if something is UTF-16 or UTF-8 encoded data) which will break the "solution" you have (that is actually only a nasty workaround). You should really consider accepting this or Karl's answer, which explains the problem in more detail. – Niklas B. Mar 23 '12 at 20:55
  • 3
    @tchris: It's not "broken" in the way you seem to think it is. `\xB5` is just the default representation of the equivalent `\u00B5`, so what's actually broken is the heuristic "bytes <256" must be UTF-8 encoded". – Niklas B. Mar 23 '12 at 20:58
5

You should convert unichrs to chrs, then decode them.

u'\xd0' == u'\u00d0' is True

$ python
>>> import re
>>> a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
>>> re.sub(r'[\000-\377]*', lambda m:''.join([chr(ord(i)) for i in m.group(0)]).decode('utf8'), a)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a'
  • r'[\000-\377]*' will match unichrs u'[\u0000-\u00ff]*'
  • u'\xd0\xb5\xd0\xba' == u'\u00d0\u00b5\u00d0\u00ba'
  • You use utf8 encoded bytes as unicode code points (this is the PROBLEM)
  • I solve the problem by pretending those mistaken unichars as the corresponding bytes
  • I search all these mistaken unichars, and convert them to chars, then decode them.

If I'm wrong, please tell me.

kev
  • 155,172
  • 47
  • 273
  • 272
  • Works indeed~ Much appreciated. – Etienne Perot Mar 23 '12 at 20:50
  • 2
    Just because a byte is in the range 0x80 to 0xff doesn't mean it's part of a UTF-8 sequence. Each of those bytes is also a valid Unicode code point, and if your string contains actual characters in that range this method will fail. – Mark Ransom Mar 23 '12 at 20:56
  • ***That’s ridiculous!*** In Perl or Java, a string that has `"\xD0"` in it is exactly the same as a string that has `"\u00D0"` or `"\x{00D0}"` in it. What a mess. That’s also why there is only one `chr` function, and no bogus `unichr` function. Doesn’t this get fixed in Python3? – tchrist Mar 23 '12 at 21:00
  • 3
    @tchrist: It's the same way in Python. The reason why there is separate `chr` and `unichr` functions is that the former produces a "classical" ASCII string, while the latter produces a unicode string. In Python 3, there's no such destinction, all strings are unicode (and consequently, `unichr` no longer exists). – Niklas B. Mar 23 '12 at 21:02
  • @NiklasB Ok, I’ve just proved to myself that it is not insane. I haven’t worked with Python for a bit, and I was afraid the OP was saying something nutty was going on. This makes me even more puzzled by the OP. I think they have the wrong mental model. (BTW, Perl and Java strings are always Unicode strings.) – tchrist Mar 23 '12 at 21:09
  • 1
    @tchrist: I think OP just faces badly broken input and tries to fix the symptom, rather than the cause. Agreed that having two different types of strings is suboptimal, I think it's mainly historical reasons that play a role here. Ruby 1.8 suffered from the same problem, I wonder why they didn't make Unicode the default right from the beginning in both languages... – Niklas B. Mar 23 '12 at 21:14
  • @Kev I like your solution, but don’t you need to do this only on the high bytes? The Perl solution is just `s/([\x80-\xff]+)/decode(utf8 => $1)/ge`. Gimme a second while I translate that into Python. – tchrist Mar 23 '12 at 21:25
  • 1
    @NiklasB Please accept my apologies for being so dense at first. Last year when I was doing Python programming I used only Python3, so wasn’t sure how Python2 worked. I now see what’s actually happening; I have to go translate my Perl solution to Python for the OP. But I much agree that they really should track down the part that’s double-encoding a piece of their string and fix that, not try to undo the damage after the fact. – tchrist Mar 23 '12 at 21:31
  • @tchrist I agree, but that code is in another Python module which I'd rather not delve into (at least not right now); I asked this because I just wanted to be able to make sense of such strings as they were – Etienne Perot Mar 23 '12 at 21:44
  • @Kev Weird, I couldn’t figure out how to do it more simply than you did it; Python’s `decode` method isn’t behaving the way Perl’s does. **I wonder why?** Anyway, you do still want to use a pattern of `[\x80-\xFF]+`, though. The Perl equiv is just `perl -CS -MEncode -le 'my $s = "\x{420}\x{443}\x{441}\x{441}\x{43a}\x{438}\x{439} \xd0\xb5\xd0\xba"; $s =~ s/([\x80-\xff]+)/decode("utf-8", $1)/ge; print $s'`, which generates the expected `Русский ек`. Why can’t I call `decode` on a Unicode string in Python the way I can in Perl? Weird. – tchrist Mar 23 '12 at 21:49
  • @tchris: You can't use `decode` on a Unicode string (it doesn't make a lot of sense, actually). What you can do is to convert it into a "binary" string using `''.join(chr(ord(c)) for c in thestring)` and then use `decode('utf-8')` on it (this is roughly what Kev does in the answer as well). Something like `re.sub(r'[\x80-\xff]+', lambda m: ''.join(chr(ord(c)) for c in m.group(0)).decode('utf-8'), s)` is what I'd try. – Niklas B. Mar 23 '12 at 21:59
  • @NiklasB Thanks; I did think about that, but I feel it actually does make sense for just this very situation here. Notice how easy it is to do that in Perl, where since all strings are Unicode strings, Perl’s `decode` necessarily takes a Unicode string argument and looks for codepoints under 256 for the decoding the way I showed above. It’s not a big deal, but the work around is a lot more tedious than it should have to be, given how annoyingly often the need to reverse a double encoding actually comes up. – tchrist Mar 23 '12 at 21:59
  • 1
    @tchrist: I think what's happening here is that decode is designed to transform binary (non-Unicode) strings into Unicode. If called with a Unicode string as an argument, it converts it to a binary string first using the default encoding (ascii, usually), which obviously fails for codepoints > 127. `s.encode('latin1').decode('utf-8')` seems like a good solution. – Niklas B. Mar 23 '12 at 22:14
5

You've already got an answer, but here's a way to unscramble UTF-8-like Unicode sequences that is less likely to decode latin-1 Unicode sequences in error. The re.sub function:

  1. Matches Unicode characters < U+0100 that resemble valid UTF-8 sequences (ref: RFC 3629).
  2. Encodes the Unicode sequence into its equivalent latin-1 byte sequence.
  3. Decodes the sequence using UTF-8 back into Unicode.
  4. Replaces the original UTF-8-like sequence with the matching Unicode character.

Note this could still match a Unicode sequence if just the right characters appear next to each other, but it is much less likely.

import re

# your example
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

# printable Unicode characters < 256.
a += ''.join(chr(n) for n in range(32,256)).decode('latin1')

# a few UTF-8 characters decoded as latin1.
a += ''.join(unichr(n) for n in [2**7-1,2**7,2**11-1,2**11]).encode('utf8').decode('latin1')

# Some non-BMP characters
a += u'\U00010000\U0010FFFF'.encode('utf8').decode('latin1')

print repr(a)

# Unicode codepoint sequences that resemble UTF-8 sequences.
p = re.compile(ur'''(?x)
    \xF0[\x90-\xBF][\x80-\xBF]{2} |  # Valid 4-byte sequences
        [\xF1-\xF3][\x80-\xBF]{3} |
    \xF4[\x80-\x8F][\x80-\xBF]{2} |

    \xE0[\xA0-\xBF][\x80-\xBF]    |  # Valid 3-byte sequences
        [\xE1-\xEC][\x80-\xBF]{2} |
    \xED[\x80-\x9F][\x80-\xBF]    |
        [\xEE-\xEF][\x80-\xBF]{2} |

    [\xC2-\xDF][\x80-\xBF]           # Valid 2-byte sequences
    ''')

def replace(m):
    return m.group(0).encode('latin1').decode('utf8')

print
print repr(p.sub(replace,a))

###Output

u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\xc2\x80\xdf\xbf\xe0\xa0\x80\xf0\x90\x80\x80\xf4\x8f\xbf\xbf'

u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\x80\u07ff\u0800\U00010000\U0010ffff'

Community
  • 1
  • 1
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
0

I solved it by

unicodeText.encode("utf-8").decode("unicode-escape").encode("latin1")
Tahirhan
  • 352
  • 1
  • 7
  • 15