How to print out text contents that contain Chinese characters in Python?

Question

I'm using python to extract contents of one web page. The html content that I focus on has some Chinese characters inside, together with other usual characters.
Then, I tried to print the html tag and its content, the printed texts are all messy code. Like below shows:

<h4>绔彛:443</h4>
<h4>A瀵嗙爜:</h4>
<h4>鍔犲瘑鏂瑰紡:aes-256-cfb</h4>

The original content are as follows:

<h4>端口:443</h4>
<h4>A远端:</h4>
<h4>加密方式:aes-256-cfb</h4>

Could you please help me how to print out the correct content in the console? I'm using python 2.7. The code snippet is as shown below:

Adding one update:
After I tried Shiva's proposal, using the lxml way, I got the result shown as below capture:

Add the second update:

Could you please tell me how to display original Chinese characters in Git bash console?
Thank you in advance!

Best regards,
Junma

score 0 · Answer 1 · edited May 23 '17 at 11:52

0

>>> print u'加密方式'.encode('utf-8').decode('gbk')
鍔犲瘑鏂瑰紡

Your console is configured to handle GBK. Configure it to handle UTF-8 instead.

edited May 23 '17 at 11:52

Community

1
1

answered Jul 01 '16 at 04:28

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

Hi Ignacio, thank you for your reply. I've attached my code snippet, since the content that I want to print out comes from one variable "h4", I feel that I cannot using your solution with u' ' way, am I right? Besides, my console can show Chinese characters correctly, it's Git Bash console. – cmjauto Jul 01 '16 at 04:38
That wasn't a solution, that was a demonstration. The solution is in the second part of my answer. – Ignacio Vazquez-Abrams Jul 01 '16 at 04:48

shiva · Accepted Answer · 2016-07-01T11:07:53.153

0

You can try:

soup=BeautifulSoup(html, "lxml", from_encoding='utf-8')

You can get the encoding by looking at the page info using firefox or chrome as shown below:

EDIT:

from bs4 import BeautifulSoup

import requests

url = "http://www.cnblogs.com/rollenholt/archive/2011/08/01/2123889.html"
html=requests.get(url).text

soup=BeautifulSoup(html, "lxml", from_encoding='utf-8')

lst=soup.find_all('span')

for h in lst:
    print h.string #or you could do print h

I get the below output wehn I run it.

edited Jul 01 '16 at 11:07

answered Jul 01 '16 at 05:32

shiva

2,535
2
18
32

Hi Shiva, it works better using lxml library. Yet, I met another issue, that current printed out log are shows as `[u'\u52a0\u5bc6\u65b9\u5f0f:aes-256-cfb']`, then how can I let it display the original Chinese characters? – cmjauto Jul 01 '16 at 06:34
Either you can print individual elements of a list or you can do something as mentioned in [here](http://stackoverflow.com/questions/20947173/printing-unicode-char-inside-a-list). – shiva Jul 01 '16 at 07:09
Hi Shiva, the method in your proposed link cannot work in some special characters. Python prompts the error: `UnicodeEncodeError: 'gbk' codec can't encode character u'\xa0' in position 10: illegal multibyte sequence.` – cmjauto Jul 01 '16 at 08:36
Did you try printing them individually? And can you share the exact code? – shiva Jul 01 '16 at 09:11
Hi Shiva, please check the code above, it's not suitable to paste in comment window, you can also try it on your side, thank you! – cmjauto Jul 01 '16 at 10:45

How to print out text contents that contain Chinese characters in Python?

2 Answers2