1

I have a simple script that I'm attempting to use automate some of the japanese translation I do for my job.

 import requests
 import sys
 import json
 base_url = 'https://www.googleapis.com/language/translate/v2?key=CANT_SHARE_THAT&source=ja&target=en&q='
 print(sys.argv[1])
 base_url += sys.argv[1]
 request = requests.get( base_url )
 if request.status_code != 200:
      print("Error on request")
 print( json.loads(request.text)['data']['translations'][0]['translatedText'])

When the first argument is a string like 初期設定クリア this script will explode at line

 print(sys.argv[1])

With the message:

 line 5, in encode
 return codecs.charmap_encode(input,self.errors,encoding_table)[0]
 UnicodeEncodeError: 'charmap' codec can't encode characters in 
 position 0-6: character maps to <undefined>

So the bug can be reduced too

 import sys
 print(sys.argv[1])

Which seems like an encoding problem. I'm using Python 3.5.1, and the terminal is MINGW64 under Windows7 x64.

When I write the same program in Rust1.8 (and the executable is ran under same conditions, i.e.: MINGW64 under Windows7 x64)

  use std::env;
  fn main() {
         let args: Vec<String> = env::args().skip(1).collect();
         print!("First arg: {}", &args[0] );
  }

It produces the proper output:

  $ rustc unicode_example.rs
  $ ./unicode_example.exe 初期設定クリア
  First arg: 初期設定クリア

So I'm trying to understand what is happening here. MINGW64 claims to have proper UTF-8 support, which it appears too. Does Python3.5.1 not have full UTF-8 support? I was under the assumption the move to Python3.X was because of Unicode support.

Valarauca
  • 1,041
  • 3
  • 10
  • 23
  • 1
    related: [Python, Unicode, and the Windows console](http://stackoverflow.com/q/5419/4279) – jfs Apr 28 '16 at 15:13
  • 2
    Ignore Mingw64 - the issue is just that the Windows terminal doesn't natively support full Unicode - See http://stackoverflow.com/questions/36236066/how-to-read-text-copied-from-web-to-txt-file-using-python/36241365#36241365 – Alastair McCormack Apr 29 '16 at 08:44

1 Answers1

1

changing

 print(sys.argv[1])

to

 print(sys.argv[1].encode("utf-8"))

Will cause python to dump a string of bytes

 $ python google_translate.py 初期設定クリア
 b'\xe5\x88\x9d\xe6\x9c\x9f\xe8\xa8\xad\xe5\xae\x9a\xe3\x82\xaf\xe3\x83
 \xaa\xe3\x82\xa2'

Nonetheless it works. So the bug, if this is a bug... Is happening when python is decoding the internal string to print into the terminal, not when the argument is being encoded INTO a python string.

Also simply removing the print statement fixes the bug as well.

Valarauca
  • 1,041
  • 3
  • 10
  • 23
  • you use a wrong terminology: `python` **encodes** Unicode strings to bytes using a console codepage while printing (decoding is the opposite direction: bytes -> Unicode). If you want to display in Windows console Unicode characters that can't be represented using your console codepage then read my answer to [the question I've linked above (before you've posted your answer)](http://stackoverflow.com/questions/36917921/python-3-5-not-handling-unicode-input-from-cli-argument#comment61399374_36917921) – jfs Apr 29 '16 at 15:10
  • I wasn't displaying to the windows console. I was displaying to a Mingw64 console. Or does the windows build of python not care about the difference? – Valarauca Apr 29 '16 at 15:50
  • 1
    I know nothing about "Mingw64 console". Have you tried to follow instructions in [my answer](http://stackoverflow.com/a/32176732/4279)? What are the results? – jfs Apr 29 '16 at 15:54
  • Your answer confirms the last sentence of my previous comment in this thread. If you can submit a link to your answer on the other question I'll accept it as a solution – Valarauca Apr 29 '16 at 16:43
  • 1
    it is a bit cryptic. Could you say explicitly "[your answer](http://stackoverflow.com/a/32176732/4279) works for me" or "it doesn't work: when I do X from [your answer](http://stackoverflow.com/a/32176732/4279); Y is displayed in the console but I expect Z". – jfs Apr 29 '16 at 16:49