2

I'd like to use BeatifulSoup to get the data in a table from a website, but it couldn't grab the Chinese character correctly. This is my code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import urllib2
from bs4 import BeautifulSoup
html=urllib2.urlopen("http://www.515fa.com/che_1978.html").read()
soup=BeautifulSoup(html,from_encoding="UTF-8")
print soup.prettify()

And the Chinese characters are displayed like this:

<td align="center" bgcolor="#FFFFFF" u1:str="" width="173">
               ćé¸</td>
<td align="center" bgcolor="#FFFFFF" u1:str="" width="149">
               ä¸ćľˇĺ¤§äź</td>
<td align="center" bgcolor="#FFFFFF" u1:str="" width="126">
               大äź</td>

I really don't know what the "ä¸ćľˇĺ¤§äź" is. I tried to change the encoding "utf-8" to "gb18030", but it didn't work. How can I get the correct Chinese characters? Thanks!

Shawn
  • 23
  • 4
  • What are you outputting this HTML *to*? The browser? The console? – deceze Aug 24 '15 at 07:27
  • @deceze the Terminal.app on MacBook. – Shawn Aug 24 '15 at 07:33
  • What's the encoding used by the terminal output? You may need to do something like `print soup.prettify().encode('gb18030')` or something like that. – Bakuriu Aug 24 '15 at 07:58
  • Is your terminal configured to use UTF-8? Or is it using gb18030? The page has a `` line. I don't have any fonts that are adequate to display Chinese properly in my terminal, but that page _does_ appear to be UTF-8 encoded. (FWIW, there's no encoding info supplied in the page's headers, and the Requests module _guesses_ the encoding to be ISO-8859-1, but that's not unusual.) – PM 2Ring Aug 24 '15 at 08:07

1 Answers1

2

Try:

html = urllib2.urlopen("http://www.515fa.com/che_1978.html")
content = html.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(content)

Not sure what exactly BeautifulSoup(from_encoding=) did but this did the trick.

esfy
  • 673
  • 1
  • 6
  • 13