2

I got a bytes data including Japanese Yen symbol (¥). This seems to be expressed as \xc2\xa5.

However, I can't decode the yen symbol. For example,

yen = b"\xc2\xa5"
type(yen) # return bytes
yen.decode("utf-8") # return UnicodeEncodeError: 'ascii' codec can't encode character '\xa5' error
import chardet
chardet.detect(yen) # return {'confidence': 0.73, 'encoding': 'windows-1252'}
yen.decode("windows-1252") # return another UnicodeEncodeError: 'ascii' codec can't encode characters error

The bytes data I have can be decoded as utf-8 in other aspects. Only Japanese Yen symbol cannot be decoded, no matter what encoding you use.

So how can I decode it?

Blaszard
  • 30,954
  • 51
  • 153
  • 233
  • 4
    I cannot reproduce your problem. `b"\xc2\xa5"` is UTF-8 encoding of `¥`. `yen.decode("utf-8")` yields `¥` (or `u'\xa5'`) in both Python2 and Python3 for me, both on Debian and OSX. What is your `sys.stdout.encoding`? It looks like it could be an error in output, rather than string manipulation itself. – Amadan Dec 07 '16 at 06:25
  • @Amadan Oh, really? I double-checked it but still got the same error. I use Python 3.5 (IPython interactive shell) on macOS. – Blaszard Dec 07 '16 at 06:27
  • 1
    `YEN=yen.decode('utf-8')` works fine for me. – ABcDexter Dec 07 '16 at 06:28
  • @Amadan It got me `'US-ASCII'`. – Blaszard Dec 07 '16 at 06:29
  • Try running as `PYTHONIOENCODING=UTF-8 ipython`? – Amadan Dec 07 '16 at 06:30
  • Try [this](http://stackoverflow.com/a/31137935/3209112) also :) – ABcDexter Dec 07 '16 at 06:34
  • @Blaszard See the accepted answer there, it tells about updating .bash_profile. – ABcDexter Dec 07 '16 at 06:38
  • So what is your `LANG`? Empty? `zh_CN.UTF-8`? Something else? – Amadan Dec 07 '16 at 06:39
  • @ABcDexter @Amadan Sorry for that the problem was in my Terminal and shell settings, not in Python. My `LANG` setting was `en_US.UTF-8` but I didn't add `export` keyword. It got me the correct value (`en_US.UTF-8`) when trying to print out `echo $LANG`, but for some reasons `sys.stdout.encoding` returns `UTF-8` only when you add `export` keyword in `~/.zprofile`. Thanks for the instruction, though. Everything works fine now. – Blaszard Dec 07 '16 at 06:53
  • 3
    @Blaszard Consider answering your own question, since it might be helpful for future readers ;) – Right leg Dec 07 '16 at 07:15

1 Answers1

2

The problem came from the settings in Terminal and the shell. Specifically, in order to make the decoding work as expected, your sys.stdout.encoding should return UTF-8.

If you don't get UTF-8, then you should check out $LANG variable. In my case it returned en_US.UTF-8, but since my ~/.zprofile didn't have export keyword, sys.stdout.encoding returned US-ASCII, not UTF-8. So you should set in your ~/.zprofile (or ~/.bash_profile) as:

export LC_ALL="en_US.UTF-8"
export LANG="en_US.UTF-8"

And now you should get UTF-8 from sys.stdout.encoding.

For more information on setting a correct locale in your shell and Terminal on macOS, check out the following question.

Community
  • 1
  • 1
Blaszard
  • 30,954
  • 51
  • 153
  • 233