Chinese character encoding error with BeautifulSoup in Python?

Question

I'd like to use BeatifulSoup to get the data in a table from a website, but it couldn't grab the Chinese character correctly. This is my code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import urllib2
from bs4 import BeautifulSoup
html=urllib2.urlopen("http://www.515fa.com/che_1978.html").read()
soup=BeautifulSoup(html,from_encoding="UTF-8")
print soup.prettify()

And the Chinese characters are displayed like this:

<td align="center" bgcolor="#FFFFFF" u1:str="" width="173">
               ćé¸</td>
<td align="center" bgcolor="#FFFFFF" u1:str="" width="149">
               ä¸ćľˇĺ¤§äź</td>
<td align="center" bgcolor="#FFFFFF" u1:str="" width="126">
               ĺ¤§äź</td>

I really don't know what the "ä¸ćľˇĺ¤§äź" is. I tried to change the encoding "utf-8" to "gb18030", but it didn't work. How can I get the correct Chinese characters? Thanks!

What are you outputting this HTML *to*? The browser? The console? — deceze, Aug 24 '15 at 07:27
What's the encoding used by the terminal output? You may need to do something like `print soup.prettify().encode('gb18030')` or something like that. — Bakuriu, Aug 24 '15 at 07:58
Is your terminal configured to use UTF-8? Or is it using gb18030? The page has a `` line. I don't have any fonts that are adequate to display Chinese properly in my terminal, but that page _does_ appear to be UTF-8 encoded. (FWIW, there's no encoding info supplied in the page's headers, and the Requests module _guesses_ the encoding to be ISO-8859-1, but that's not unusual.) — PM 2Ring, Aug 24 '15 at 08:07

score 2 · Accepted Answer · answered Aug 24 '15 at 07:57

2

Try:

html = urllib2.urlopen("http://www.515fa.com/che_1978.html")
content = html.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(content)

Not sure what exactly BeautifulSoup(from_encoding=) did but this did the trick.

answered Aug 24 '15 at 07:57

esfy

673
1
6
13

Yes! It works! Thank you very much! I've been working on this for 4 hours! :-) – Shawn Aug 24 '15 at 08:07

Chinese character encoding error with BeautifulSoup in Python?

1 Answers1

Linked