0

I'm using python to extract contents of one web page. The html content that I focus on has some Chinese characters inside, together with other usual characters.
Then, I tried to print the html tag and its content, the printed texts are all messy code. Like below shows:

<h4>绔彛:443</h4>
<h4>A瀵嗙爜:</h4>
<h4>鍔犲瘑鏂瑰紡:aes-256-cfb</h4>

The original content are as follows:

<h4>端口:443</h4>
<h4>A远端:</h4>
<h4>加密方式:aes-256-cfb</h4>

Could you please help me how to print out the correct content in the console? I'm using python 2.7. The code snippet is as shown below:
enter image description here

Adding one update:
After I tried Shiva's proposal, using the lxml way, I got the result shown as below capture:
enter image description here

Add the second update:
enter image description here

Could you please tell me how to display original Chinese characters in Git bash console?
Thank you in advance!

Best regards,
Junma

cmjauto
  • 19
  • 6

2 Answers2

0
>>> print u'加密方式'.encode('utf-8').decode('gbk')
鍔犲瘑鏂瑰紡

Your console is configured to handle GBK. Configure it to handle UTF-8 instead.

Community
  • 1
  • 1
Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
  • Hi Ignacio, thank you for your reply. I've attached my code snippet, since the content that I want to print out comes from one variable "h4", I feel that I cannot using your solution with u' ' way, am I right? Besides, my console can show Chinese characters correctly, it's Git Bash console. – cmjauto Jul 01 '16 at 04:38
  • That wasn't a solution, that was a demonstration. The solution is in the second part of my answer. – Ignacio Vazquez-Abrams Jul 01 '16 at 04:48
0

You can try:

soup=BeautifulSoup(html, "lxml", from_encoding='utf-8')

You can get the encoding by looking at the page info using firefox or chrome as shown below:

EDIT:

from bs4 import BeautifulSoup

import requests

url = "http://www.cnblogs.com/rollenholt/archive/2011/08/01/2123889.html"
html=requests.get(url).text

soup=BeautifulSoup(html, "lxml", from_encoding='utf-8')

lst=soup.find_all('span')

for h in lst:
    print h.string #or you could do print h

I get the below output wehn I run it. enter image description here

shiva
  • 2,535
  • 2
  • 18
  • 32
  • Hi Shiva, it works better using lxml library. Yet, I met another issue, that current printed out log are shows as `[u'\u52a0\u5bc6\u65b9\u5f0f:aes-256-cfb']`, then how can I let it display the original Chinese characters? – cmjauto Jul 01 '16 at 06:34
  • Either you can print individual elements of a list or you can do something as mentioned in [here](http://stackoverflow.com/questions/20947173/printing-unicode-char-inside-a-list). – shiva Jul 01 '16 at 07:09
  • Hi Shiva, the method in your proposed link cannot work in some special characters. Python prompts the error: `UnicodeEncodeError: 'gbk' codec can't encode character u'\xa0' in position 10: illegal multibyte sequence.` – cmjauto Jul 01 '16 at 08:36
  • Did you try printing them individually? And can you share the exact code? – shiva Jul 01 '16 at 09:11
  • Hi Shiva, please check the code above, it's not suitable to paste in comment window, you can also try it on your side, thank you! – cmjauto Jul 01 '16 at 10:45