Some problems of Python crawler

Question

And I'm just suffering from the question about python crawler.

First, the websites have two different hexadecimal of Chinese chracters. I can convert one of them (which is E4BDA0E5A5BD), the other one is C4E3BAC3 which I have no method to convert, or maybe I am missing some methods. The two hexadecimal values are '你好' in Chinese.

Second, I have found a website which can convert the hexadecimal, and to my surprise the answer is exactly what I cannot covert by myself.

The url is http://www.uol123.com/hantohex.html

Then I made a question: how to get the result which is in the text box (well I don't know what it is called exactly). I used firefox + httpfox to observe the post's data, and I find that the result which is converted by the website is in the Content, here is the pic:

And then I print the post, it has POST Data, and some headers, but no info about Content.

Third, then I google how to use ajax, and I really found a code about how to use ajax.

Here is the url http://outofmemory.cn/code-snippet/1885/python-moni-ajax-request-get-ajax-request-response But when I run this, it has an error which says "ValueError: No JSON object could be decoded."

And pardon that I am a newbie, so I cannot post images!!!

I am looking forward to your help sincerely.

Any help will be appreciated.

Can you supply the url to the website you're trying to read? — ToonAlfrink, May 20 '14 at 15:18
If you cannot post images, then maybe you should remove the reference to a picture that is not there ? — logc, May 20 '14 at 15:39
Furthermore, it is really difficult to follow what you are asking exactly. Could you please rephrase the question into a clearly stated question, like: How do you parse these two Chinese symbols out of a website's Content? — logc, May 20 '14 at 15:40

score 0 · Answer 1 · answered May 20 '14 at 15:59

you're talking about different encodings for these chinese characters. there are at least three different widely used encodings guobiao (for mainland China), big5 (on Taiwan) and unicode (everywhere else).

here's how to convert your kanji into the different encodings:

>>> a = u'你好'             -- your original characters
>>> a
u'\u4f60\u597d'            -- in unicode
>>> a.encode('utf-8')
'\xe4\xbd\xa0\xe5\xa5\xbd' -- in UTF-8
>>> a.encode('big5')
'\xa7A\xa6n'               -- in Taiwanese Big5
>>> a.encode('gb2312-80')
'\xc4\xe3\xba\xc3'         -- in Guobiao
>>>

You may check other available encodings here.

Ah, almost forgot. to convert from Unicode into the encoding you use encode() method. to convert back from the encoded contents of the web site you may use decode() method. just don't forget to specify the correct encoding.

well, I can command the first two whose results are different from yours,but there are errers in the last two. >>> a=u'你好' >>> a u'\xc4\xe3\xba\xc3' >>> a.encode('utf-8') '\xc3\x84\xc3\xa3\xc2\xba\xc3\x83' — Windsor_Gu, May 21 '14 at 02:33
u'\xc4\xe3\xba\xc3' is not unicode, it's some kind of windows wncoding, most likely Guobiao (see above, it is same), you should do something like: `a = '你好'.decode('gb2312-80')` to get a **REAL** unicode. — lenik, May 21 '14 at 09:33

Some problems of Python crawler

1 Answers1