Convert GBK to utf8 string in python

Question

I have a string.

s = u"<script language=javascript>alert('\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!');location='index.asp';</script></script>"

How can I translate s into a utf-8 string? I have tried s.decode('gbk').encode('utf-8') but python reports error: UnicodeEncodeError: 'ascii' codec can't encode characters in position 35-50: ordinal not in range(128)

When you add `u` your string is already being treated as an unicode string by python. — RedX, Apr 16 '14 at 08:08

score 6 · Answer 1 · answered Apr 16 '14 at 08:53

6

in python2, try this to convert your unicode string:

>>> s.encode('latin-1').decode('gbk')
u"<script language=javascript>alert('\u8bf7\u8f93\u5165\u6b63\u786e\u9a8c\u8bc1\u7801,\u8c22\u8c22!');location='index.asp';</script></script>"

then you can encode to utf-8 as you wish.

>>> s.encode('latin-1').decode('gbk').encode('utf-8')
"<script language=javascript>alert('\xe8\xaf\xb7\xe8\xbe\x93\xe5\x85\xa5\xe6\xad\xa3\xe7\xa1\xae\xe9\xaa\x8c\xe8\xaf\x81\xe7\xa0\x81,\xe8\xb0\xa2\xe8\xb0\xa2!');location='index.asp';</script></script>"

answered Apr 16 '14 at 08:53

hyu163

79
3

2

The detour over `latin-1` is shocking. Yes, it's a workaround, but that is really not how you do it. – tripleee Aug 22 '14 at 09:36
@tripleee not shocking at all once you know the mechanics behind it. Unicode used Latin-1 as its base for the first 256 codepoints, so if you need those codepoints as bytes it's a 1:1 mapping. Obviously it's better to get the decoding done properly in the first place, but sometimes with [Mojibake](https://en.wikipedia.org/wiki/Mojibake) that's impossible. – Mark Ransom Nov 30 '16 at 18:37

score 3 · Answer 2 · answered Aug 22 '14 at 09:32

You are mixing apples and oranges. The GBK-encoded string is not a Unicode string and should hence not end up in a u'...' string.

This is the correct way to do it in Python 2.

g = '\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,' \
    '\xd0\xbb\xd0\xbb!'.decode('gbk')
s = u"<script language=javascript>alert(" + g + 
    u");location='index.asp';</script></script>"

Notice how the initializer for g which is passed to .decode('gbk') is not represented as a Unicode string, but as a plain byte string.

See also http://nedbatchelder.com/text/unipain.html

Ivaylo · Answer 3 · 2014-04-16T09:20:09.367

If you can keep the alert in a separate string "a":

a = '\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!'.decode("gbk")
s = u"<script language=javascript>alert('"+a+"');location='index.asp';</script></script>"
print s

Then it will print:

<script language=javascript>alert('请输入正确验证码,谢谢!');location='index.asp';</script></script>

If you want to automatically extract the substring in one go:

s = "<script language=javascript>alert('\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!');location='index.asp';</script></script>"
s = unicode("'".join((s.decode("gbk").split("'",2))))
print s

will print:

 <script language=javascript>alert('请输入正确验证码,谢谢!');location='index.asp';</script></script>

@amazingjxq: In the second method pay attention, that the string s is plain, not s=u''. — Ivaylo, Apr 16 '14 at 09:12

score -1 · Answer 4 · answered Apr 16 '14 at 08:38

-1

Take a look at unicodedata but I think one way to do this is:

import unicodedata

s = u"<script language=javascript>alert('\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!');location='index.asp';</script></script>"
unicodedata.normalize('NFKD', s).encode('utf-8','ignore')

answered Apr 16 '14 at 08:38

s16h

4,647
1
21
33

Could down-voter please explain why he/she down-voted so I can learn too? Thanks. – s16h Aug 25 '14 at 12:28

score -1 · Answer 5 · answered Aug 22 '14 at 09:20

I got the same question

Like this:

name = u'\xb9\xc5\xbd\xa3\xc6\xe6\xcc\xb7'

I want convert to

u'\u53e4\u5251\u5947\u8c2d'

Here is my solution:

new_name = name.encode('iso-8859-1').decode('gbk')

And I tried yours

s = u"alert('\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!');location='index.asp';"

print s

alert('ÇëÊäÈëÕýÈ·ÑéÖ¤Âë,Ð»Ð»!');location='index.asp';

Then:

_s = s.encode('iso-8859-1').decode('gbk')

print _s

alert('请输入正确验证码,谢谢!');location='index.asp';

Hope can help you ..

Convert GBK to utf8 string in python

5 Answers5

Linked