16

Here is a little program:

import sys

f = sys.argv[1]
print type(f)
print u"f=%s" % (f)

Here is my running of the program:

$ python x.py 'Recent/רשימת משתתפים.LNK'
<type 'str'>
Traceback (most recent call last):
  File "x.py", line 5, in <module>
    print u"f=%s" % (f)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 7: ordinal not in range(128)
$ 

The problem is that sys.argv[1] is thinking that it's getting an ascii string, which it can't convert to Unicode. But I'm using a Mac with a full Unicode-aware Terminal, so x.py is actually getting a Unicode string. How do I tell Python that sys.argv[] is Unicode and not Ascii? Failing that, how do I convert ASCII (that has unicode inside it) into Unicode? The obvious conversions don't work.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
vy32
  • 28,461
  • 37
  • 122
  • 246
  • 1
    possible duplicate? http://stackoverflow.com/questions/846850/how-to-read-unicode-characters-from-command-line-arguments-in-python-on-windows – Ben Feb 25 '11 at 04:39
  • Ben, this is question is Mac specific insofar as Unicode is concerned, although it does certainly touch on some of the same concepts. – mkelley33 Feb 25 '11 at 04:44

5 Answers5

21

The UnicodeDecodeError error you see is due to you're mixing the Unicode string u"f=%s" and the sys.argv[1] bytestring:

  • both bytestrings:

      $ python2 -c'import sys; print "f=%s" % (sys.argv[1],)' 'Recent/רשימת משתתפים'
    

    This passes bytes transparently from/to your terminal. It works for any encoding.

  • both Unicode:

      $ python2 -c'import sys; print u"f=%s" % (sys.argv[1].decode("utf-8"),)' 'Rec..
    

    Here you should replace 'utf-8' by the encoding your terminal uses. You might use sys.getfilesystemencoding() here if the terminal is not Unicode-aware.

Both commands produce the same output:

f=Recent/רשימת משתתפים

In general you should convert bytestrings that you consider to be text to Unicode as soon as possible.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Actually, I figured out the problem. It turns out that Python regards utf-8 not as Unicode, but as ASCII. Try `print type(u"foobar".encode('utf-8')) and you'll get `str` and not type `unicode`. – vy32 Feb 25 '11 at 22:22
  • 5
    @vy32: `'utf-8'` is a character encoding. It is *not* Unicode in any context. `.encode()` method can convert Unicode string (text) into bytestring (data). You have some misconceptions about what Unicode is. Please, read http://www.joelonsoftware.com/articles/Unicode.html – jfs Feb 25 '11 at 23:56
  • Thanks. I have, in fact, read that. The problem is the author's assertion "The Single Most Important Fact About Encodings---It does not make sense to have a string without knowing what encoding it uses." In my area of work it is quite common to have strings without knowing the encoding that they use. We also see strings that change encodings in the middle. – vy32 Feb 27 '11 at 04:09
  • 1
    @vy32: if you don't know the encoding then the input may be ambiguous e.g., ["Bush hid the facts"](https://en.wikipedia.org/wiki/Bush_hid_the_facts) and [Garbage in, garbage out](https://en.wikipedia.org/wiki/Garbage_in,_garbage_out) – jfs Oct 22 '15 at 09:24
  • 1
    `sys.getfilesystemencoding()` is a lifesaver. Thank you! – Alexander Revo Jan 30 '16 at 10:35
5
sys.argv = map(lambda arg: arg.decode(sys.stdout.encoding), sys.argv)

or you can pick encoding from locale.getdefaultlocale()[1]

sherpya
  • 4,890
  • 2
  • 34
  • 50
3

try either:

f = sys.argv[1].decode('utf-8')

or:

f = unicode(sys.argv[1], 'utf-8')
DmitrySandalov
  • 3,879
  • 3
  • 23
  • 17
3

Command line parameters are passed into Python as byte string using the encoding as used on the shell used for started Python. So there is no way for having commandline parameters passed into Python as unicode string other than converting parameters yourself to unicode inside your application.

  • 3
    You don't have to convert the parameters *yourself*. Python could use a wide API if OS provides it. `sys.argv` is a Unicode string in Python3. – jfs Feb 25 '11 at 05:44
  • 1
    @J.F Sebastian +1 for using Python 3! In the version of Python @vy32 is using the arg does have to be converted either at the shell as I put in my answer below or in code or upgrade to python 3! – mkelley33 Feb 25 '11 at 14:48
  • @mkelley33: I've meant to say: you don't have to convert parameters yourself *in principle*. It is just a deficiency of CPython2 implementation. Python3 is just an example that a software *can* do it for you. Python3 won't be ready for a wide adoption at least several years. I did not say that you should use it. – jfs Feb 27 '11 at 21:01
  • @J.F. Sebastian: Understood my friend, but of course you should use it (Python3)! I would have to examine what dependencies are demanded by a project as well as which limitations might adversely affect any development in the version of Python under consideration. Though a number of packages are still unavailable for Python3, I wouldn't suggest that @pynator wait years to use it unless the project in question exhibits some measure of complexity that *might* require dependencies not yet ready for Python 3 :) Cheers! – mkelley33 Feb 28 '11 at 00:43
2
  1. sys.argv is never "in Unicode"; it's encoded for sure, but Unicode is not an encoding, rather it is a set of code points (numbers), where each number uniquely represents a character. http://www.unicode.org/standard/WhatIsUnicode.html

  2. Go to Terminal.app > Terminal > Preferences > Settings > Character encoding, and select UTF-8 from the drop-down list.

  3. Also, the default Python that ships with Mac OS X has one flaw with regards to Unicode: its built using the deprecated UCS-2 by default; see: http://webamused.wordpress.com/2011/01/31/building-64-bit-python-python-org-using-ucs-4-on-mac-os-x-10-6-6-snow-leopard/

mkelley33
  • 5,323
  • 10
  • 47
  • 71
  • To test out #2 go to System Preferences > Language & Text > Input Sources > an mark checked Unicode Hex Input. Open an interactive interpreter session, and now type alt (option) + 00a9. If you see © copyright symbol, then your Terminal input is UTF-8 encoded, but you may still need to build Python using the UCS-4 option. – mkelley33 Feb 25 '11 at 05:09
  • 1
    It turns out that Python considers UTF-8 to be ASCII, not Unicode. Gosh, I find this confusing. – vy32 Feb 25 '11 at 22:23
  • 1
    note: "a unique number for every character" is not equivalent to each "number uniquely represents a character". There is a subtle difference. The latter quote is not entirely correct. A code point always points to the same character, but a character can be represented in more than one way using code points. For example, `U+00E9` → `é` **and** `U+0065 U+0301` → `é` i.e., a code point doesn't uniquely represents a character thus there is Unicode normalization http://www.unicode.org/reports/tr15/ to avoid ambiguity in binary representations of Unicode strings. – jfs Feb 26 '11 at 00:30
  • While it is true that Unicode is not an encoding, Unicode *does* include encodings (most notably UTF-8). – Eric O. Lebigot Apr 12 '15 at 08:47