3

I am using the following code to scrape a webpage that contains Japanese characters:

import urllib2
import bs4
import time

url = 'http://www.city.sapporo.jp/eisei/tiiki/toban.html'

pagecontent = urllib2.urlopen(url)
soup = bs4.BeautifulSoup(pagecontent.read().decode("utf8"))

print(soup.prettify())
print(soup)

In some machines the code works fine, and the last two statements print the result successfully. However, in some machines the last but one statement gives the error

UnicodeEncodeError 'ascii' codec can't encode characters in position 485-496: ordinal not in range(128),

and the last statement prints strange squares for all Japanese characters.

Why the same code works differently for two machines? How can I fix this?

Python version 2.6.6

bs4 version: 4.1.0

shapeare
  • 4,133
  • 7
  • 28
  • 39
  • 1
    You are printing *Unicode data* and Python needs to encode that to match your Python terminal or console encoding. You'll need to fix your terminal to tell Python correctly what codecs it accepts. Currently it tells Python that only ASCII will do. What is the console or terminal you are using? – Martijn Pieters Dec 21 '14 at 15:53
  • @MartijnPieters I am using CentOS's default terminal. – shapeare Dec 21 '14 at 15:55
  • Then set the `LANG` environment variable, see http://www.cl.cam.ac.uk/~mgk25/unicode.html. – Martijn Pieters Dec 21 '14 at 15:59

1 Answers1

7

You need to configure your environment locale correctly; once your locale is set, Python will pick it up automatically when printing to a terminal.

Check your locale with the locale command:

$ locale
LANG="en_GB.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"

Note the .UTF-8 in my locale settings; it tells programs running in the terminal that my terminal uses the UTF-8 codec, one that supports all of Unicode.

You can set all of your locale in one step with the LANG environment variable:

export LANG="en_US.UTF-8"

for a US locale (how dates and numbers are printed) with the UTF-8 codec. To be precise, the LC_CTYPE setting is used for the output codec, which in turn defaults to the LANG value.

Also see the very comprehensive UTF-8 and Unicode FAQ for Unix/Linux.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • I just checked my machine that runs Mac OSX and can correctly print out the result, which has 'locale' set as: LANG= LC_COLLATE="C" LC_CTYPE="UTF-8" LC_MESSAGES="C" LC_MONETARY="C" LC_NUMERIC="C" LC_TIME="C" LC_ALL= – shapeare Dec 21 '14 at 16:20
  • @shapeare: yes, because the output encoding is taken from `LC_CTYPE`. – Martijn Pieters Dec 21 '14 at 16:22
  • I have added a line `export LANG="ja_JP.UTF-8"` in `~/.bashrc`, and then run `source ~/.bashrc` so it takes effect. But the problem still exists, still saying `'ascii' codec can't encode characters` – shapeare Dec 21 '14 at 16:41
  • @shapeare: The traceback is still pointing to the `print` statement? What does `import sys; print sys.stdout.encoding` show Python detected? – Martijn Pieters Dec 21 '14 at 16:43
  • The traceback still points to print statement. The sys.stdout.encoding prints out `ANSI_X3.4-1968` – shapeare Dec 21 '14 at 16:47
  • @shapeare: that clearly is not the right codec you configured. What does `locale` (in the shell) say `LC_CTYPE` is set to? In Python `import os; print os.environ.get('LC_CTYPE'), os.environ.get('LANG')` is interesting too, as is `import locale; print locale.getdefaultlocale()`. – Martijn Pieters Dec 21 '14 at 16:52
  • `os.environ.get('LC_CTYPE')` prints out `UTF-8`, `os.environ.get('LANG')` prints out `ja-JP.UTF-8` and `locale.getdefaultlocale()` raises an error `unknown locale: UTF-8` – shapeare Dec 21 '14 at 17:00
  • @shapeare: interesting, same symptoms as https://github.com/iElectric/almir/issues/59 – Martijn Pieters Dec 21 '14 at 17:05
  • @shapeare: the error indicates that your `_locale.so` module wasn't being loaded; what does `import _locale` do for you? Without that module Python has no UTF-8 codec to encode. – Martijn Pieters Dec 21 '14 at 17:08
  • Now I have changed to use `export LC_ALL=en_US.UTF-8 export LANG=en_US.UTF-8`. The error at the print statement has disappeared, but it still prints squares for each Japanese character. Adding `import _locale` to my code doesn't introduce any error. – shapeare Dec 21 '14 at 17:14
  • @shapeare: is your terminal actually set up for UTF-8 as well? Telling Python to print UTF-8 bytes is just one step in the chain. – Martijn Pieters Dec 21 '14 at 17:17