python crawler get messy code which seems has muti type of coding

Question

I got a location u'\u0107\x9d\xad\u013a\u02c7\x9e\u013a\xb8\x82', which actually should be '\xe6\x9d\xad\xe5\xb7\x9e\xe5\xb8\x82'. How can I decode something like this?

How did you create that location Unicode string? Why do you believe it should be `'\xe6\x9d\xad\xe5\xb7\x9e\xe5\xb8\x82'`? What encoding(s) are you using. I presume you're using Python 2, but what OS are you using? FWIW, if we presume that your 2nd string is UTF-8, it decodes to `'杭州市'` , which is `'\u676d\u5dde\u5e02'` using Unicode escape sequences. — PM 2Ring, Feb 11 '17 at 08:22
I got that string from crawler, and I can see the result from their orginal page which is ''杭州市''. Ya, I'm using python2.7, and I got that string both on Mac and Centos7. There is somthing really strange that if I visit that url from Chrome, it shows correct result which is ''杭州市'', but if I open the dev tool of Chrome, it shows "æå·žå¸‚" — lingeng, Feb 11 '17 at 08:39
That crawler code is broken, or not configured correctly. And you still didn't mention which encoding(s) you're using. You may find this article helpful: [Pragmatic Unicode](http://nedbatchelder.com/text/unipain.html), which was written by SO veteran Ned Batchelder. — PM 2Ring, Feb 11 '17 at 09:37

McGrady · Answer 1 · 2017-02-11T10:28:22.997

0

I suggest you read python 2.7 unicode.

\u0107\x9d\xad\u013a\u02c7\x9e\u013a\xb8\x82 does not equal \xe6\x9d\xad\xe5\xb7\x9e\xe5\xb8\x82,so I suppose there is something wrong with your crawler code.

In python2.x,you should be careful with the encoding problem.In Python2 we have two text types: str which for all intents and purposes is limited to ASCII + some undefined data above the 7 bit range, unicode which is equivalent to the Python 3 str type and one byte type bytearray which it inherited from Python 3.

Python2 provides a migration path from non-Unicode into Unicode by allowing coercion of byte-strings and non byte-strings. You can check out More About Unicode in Python 2 and 3.

Also you can add this at the start of your script,it sets system default encoding as utf-8 .It's userful for testing program and it will fix your issue.

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

As a matter of fact,I don't suggest programmer use this in large program.It might trigger other issues.

The encoding problem in Python2.x is really discouraged,and if you want to avoid encoding problem, you should start to think seriously about switching to Python3.

Hope this helps.

edited Feb 11 '17 at 10:28

answered Feb 11 '17 at 08:15

McGrady

10,869
13
47
69

Thanks for your answer. But I want to know how to decode that string to '杭州市', not "ćĺˇĺ¸" – lingeng Feb 11 '17 at 08:53
Coding directive comments do not affect the system default encoding. They merely tell the Python interpreter which encoding was used to create the file containing the script, IOW, they only affect the decoding of the script itself, they have no effect on the external data that the script reads or writes. – PM 2Ring Feb 11 '17 at 09:31
@PM2Ring [setdefaultencoding](https://docs.python.org/2/library/sys.html#sys.setdefaultencoding) – McGrady Feb 11 '17 at 09:34
That `setdefaultencoding` hack isn't a good idea. Please see [Dangers of sys.setdefaultencoding('utf-8')](http://stackoverflow.com/q/28657010/4014959) and http://stackoverflow.com/a/28127538/4014959 – PM 2Ring Feb 11 '17 at 09:41
@PM2Ring Many people discuss that, and yes ,it might harm to the program,I will edit my answer ,but on the other hand it's very usefull for testing code. – McGrady Feb 11 '17 at 10:10

python crawler get messy code which seems has muti type of coding

1 Answers1