0

All my scripts use Unicode literals throughout, with

from __future__ import unicode_literals

but this creates a problem when there is the potential for functions being called with bytestrings, and I'm wondering what the best approach is for handling this and producing clear helpful errors.

I gather that one common approach, which I've adopted, is to simply make this clear when it occurs, with something like

def my_func(somearg):
    """The 'somearg' argument must be Unicode."""
    if not isinstance(arg, unicode):
        raise TypeError("Parameter 'somearg' should be a Unicode")
    # ...

for all arguments that need to be Unicode (and might be bytestrings). However even if I do this, I encounter problems with my argparse command line script if supplied parameters correspond to such arguments, and I wonder what the best approach here is. It seems that I can simply check the encoding of such arguments, and decode them using that encoding, with, for example

if __name__ == '__main__':
    parser = argparse.ArgumentParser(...)
    parser.add_argument('somearg', ...)
    # ...

    args = parser.parse_args()
    some_arg = args.somearg
    if not isinstance(config_arg, unicode):
        some_arg = some_arg.decode(sys.getfilesystemencoding())

    #...
    my_func(some_arg, ...)

Is this combination of approaches a common design pattern for Unicode modules that may receive bytestring inputs? Specifically,

  • can I reliable decode command line arguments in this way, and
  • will sys.getfilesystemencoding() give me the correct encoding for command line arguments; or
  • does argparse provide some builtin facility for accomplishing this that I've missed?
Community
  • 1
  • 1
orome
  • 45,163
  • 57
  • 202
  • 418
  • the `unicode_literals` import has nothing to do with the character encoding used for command-line arguments. – jfs Nov 21 '15 at 08:22
  • @J.F.Sebastian: How so? Using `unicode_literals` means that my code uses Unicode literals, so that any command line strings will get decoded. That's why I need to know the encoding; otherwise I'll get exceptions. – orome Nov 21 '15 at 13:39
  • Command-line is not part of your Python code. Do you understand the word "literal"? e.g., `some_python_name` is not a string literal whatever type `some_python_name` has. `"abc"` in Python source is a string literal (without `unicode_literals` it is a bytestring on Python 2). `sys.argv[i]` is not a literal: its value does not change whether you use `unicode_literals` or not (`print sys.argv` and see for yourself). – jfs Nov 21 '15 at 14:17
  • @J.F.Sebastian: I think you don't understand the question. – orome Nov 21 '15 at 14:20
  • I am investigating this further as there seems to be conflicting references about it. There are also [bugs](http://bugs.python.org/issue2128), so you might want to mention your platform / operating system. – wim Nov 21 '15 at 19:53
  • @raxacoricofallapatorius btw, JF Seb is correct that the future import has absolutely nothing to do with command line arguments. It only effects *literals* i.e. strings written into the source code in quotes. Given the example code in your question, it seems you don't understand what a literal is. `"These" 'are' """all""" b'examples' u'of' r'string literals'` and the future import means you don't need the u prefix to declare a literal to be a unicode string instead of a bytestring. – wim Nov 21 '15 at 20:40
  • @wim: Yes, and that's why It has everything "to do with" command line arguments. Having `unicode_literals` means command line args will get implicitly decoded by innocent operations (e.g. by simply `arg + 'a'`), which is the whole reason for taking control of the decoding right away and figuring out what the their encoding is. – orome Nov 21 '15 at 21:52
  • OK, that's right, but in your code you have `The 'somearg' argument must be a Unicode literal` and it doesn't make any sense. You mean to say it should be a unicode object, not a unicode literal. – wim Nov 21 '15 at 21:58
  • @wim: Ah yes, sloppy. The only reason is made sense in context was that in the case in point (test messages to me) it was just that: a string written in source in quotes. – orome Nov 21 '15 at 22:05

2 Answers2

1

I don't think getfilesystemencoding will necessarily get the right encoding for the shell, it depends on the shell (and can be customised by the shell, independent of the filesystem). The file system encoding is only concerned with how non-ascii filenames are stored.

Instead, you should probably be looking at sys.stdin.encoding which will give you the encoding for standard input.

Additionally, you might consider using the type keyword argument when you add an argument:

import sys
import argparse as ap

def foo(str_, encoding=sys.stdin.encoding):
    return str_.decode(encoding)

parser = ap.ArgumentParser()
parser.add_argument('my_int', type=int)
parser.add_argument('my_arg', type=foo)
args = parser.parse_args()

print repr(args)

Demo:

$ python spam.py abc hello
usage: spam.py [-h] my_int my_arg
spam.py: error: argument my_int: invalid int value: 'abc'
$ python spam.py 123 hello
Namespace(my_arg=u'hello', my_int=123)
$ python spam.py 123 ollǝɥ
Namespace(my_arg=u'oll\u01dd\u0265', my_int=123)

If you have to work with non-ascii data a lot, I would highly recommend upgrading to python3. Everything is a lot easier there, for example, parsed arguments will already be unicode on python3.


Since there is conflicting information about the command line argument encoding around, I decided to test it by changing my shell encoding to latin-1 whilst leaving the file system encoding as utf-8. For my tests I use the c-cedilla character which has a different encoding in these two:

>>> u'Ç'.encode('ISO8859-1')
'\xc7'
>>> u'Ç'.encode('utf-8')
'\xc3\x87'

Now I create an example script:

#!/usr/bin/python2.7
import argparse as ap
import sys

print 'sys.stdin.encoding is ', sys.stdin.encoding
print 'sys.getfilesystemencoding() is', sys.getfilesystemencoding()

def encoded(s):
    print 'encoded', repr(s)
    return s

def decoded_filesystemencoding(s):
    try:
        s = s.decode(sys.getfilesystemencoding())
    except UnicodeDecodeError:
        s = 'failed!'
    return s

def decoded_stdinputencoding(s):
    try:
        s = s.decode(sys.stdin.encoding)
    except UnicodeDecodeError:
        s = 'failed!'
    return s

parser = ap.ArgumentParser()
parser.add_argument('first', type=encoded)
parser.add_argument('second', type=decoded_filesystemencoding)
parser.add_argument('third', type=decoded_stdinputencoding)
args = parser.parse_args()

print repr(args)

Then I change my shell encoding to ISO/IEC 8859-1:

enter image description here

And I call the script:

wim-macbook:tmp wim$ ./spam.py Ç Ç Ç
sys.stdin.encoding is  ISO8859-1
sys.getfilesystemencoding() is utf-8
encoded '\xc7'
Namespace(first='\xc7', second='failed!', third=u'\xc7')

As you can see, the command line arguments were encoding in latin-1, and so the second command line argument (using sys.getfilesystemencoding) fails to decode. The third command line argument (using sys.stdin.encoding) decodes correctly.

wim
  • 338,267
  • 99
  • 616
  • 750
  • That's clever. Can you say a bit more about how `foo` is working? There's some magic going on there. – orome Nov 19 '15 at 19:42
  • 1
    It's not very magical, the keyword argument `type` should simply be a callable which returns the converted-to-python object from the incoming bytes. It should raise an `argparse.ArgumentTypeError` if the conversion fails. – wim Nov 19 '15 at 19:47
  • Can it ever happen that the the command line supplies Unicode (so that `decode` should not be called at all)? – orome Nov 19 '15 at 20:15
  • I don't know, but I don't think so. You could always put the `isinstance` check into your callable if you are worried about that. Or follow unutbu advice [here](http://stackoverflow.com/q/3857763/674039). Using `unicode(text, 'utf-8')` will fail hard with `TypeError: decoding Unicode is not supported` if text is already a unicode object. Using `.decode`, however, will do weird implicit conversion stuff (something to the effect of `str_.encode(sys.getdefaultencoding()).decode(encoding)`) – wim Nov 19 '15 at 21:16
  • `sys.stdin.encoding` is not the correct encoding for command-line arguments. `sys.getfilesystemencoding()` is the correct encoding. – jfs Nov 21 '15 at 09:29
  • 1
    @J.F.Sebastian: Can you document that? There's clearly confusion about that pint on the Web. – orome Nov 21 '15 at 13:40
  • @J.F.Sebastian After some quick tests, I believe you are mistaken. I'll edit my post with an example. – wim Nov 21 '15 at 20:04
  • Set PYTHONIOENCODING and see what happens. Run python3 where `sys.argv` contains Unicode strings and see that it uses `sys.getfilesystemenciding()` (utf-8 on OS X, not latin-1). – jfs Nov 21 '15 at 20:28
  • @J.F.Sebastian Why? AFAIK the OP is using python2.7 and does not have that environment variable set. – wim Nov 21 '15 at 20:34
  • 1. Set the envvar, to see that `sys.stdin.encoding` and command-line arguments encodings are different 2. Do you think that `sys.getfilesystemencoding()` has a different purpose on Python 3? What part is wrong in your opinion: (a) OS data: filenames, environment variables, command-line arguments use the same encoding (from Python point of view) (b) that encoding is `sys.getfilesystemencoding()`? – jfs Nov 21 '15 at 20:42
  • 1
    It would seem the counterexample I have posted demonstrates (b) to be wrong. – wim Nov 21 '15 at 20:46
  • @wim: it does not show that `sys.stdin.encoding` is used. It only shows that you can't always decode data using `sys.getfilesystemencoding()`. Python 2 passes bytes as is on POSIX and command-line argument can be an arbitrary sequence of bytes except zero (read PEP 383 for more details). See examples in [my answer](http://stackoverflow.com/a/33841721/4279). Use `@` syntax if you want me to be notified about your comments. – jfs Nov 22 '15 at 04:39
  • I seem not to be the only person who this confuses (try `yolk -M crypto-enigma`; there's a `β` [on the PyPi page](https://pypi.python.org/pypi/crypto-enigma). – orome Nov 22 '15 at 20:41
0

sys.getfilesystemencoding() is the correct(but see examples) encoding for OS data such as filenames, environment variables, and command-line arguments.

You could see the logic behind the choice: sys.argv[0] may be the path to the script (the filename) and therefore it is natural to assume that it uses the same encoding as other filenames and that other items in the argv list use the same character encoding as sys.argv[0]. os.environ['PATH'] contains paths and therefore it is also natural that environment variables use the same encoding:

$ echo 'import sys; print(sys.argv)' >print_argv.py
$ python print_argv.py
['print_argv.py']

Note: sys.argv[0] is the script filename whatever other command-line arguments you might have.

"best way" depends on your specific use-case e.g., on Windows, you should probably use Unicode API directly (CommandLineToArgvW()). On POSIX, if all you need is to pass some argv items to OS functions back (such as os.listdir()) then you could leave them as bytes -- command-line argument can be arbitrary byte sequence, see PEP 0383 -- Non-decodable Bytes in System Character Interfaces:

import os, sys

os.execl(sys.executable, sys.executable, '-c', 'import sys; print(sys.argv)',
         bytes(bytearray(range(1, 0x100))))

As you can see POSIX allows to pass any bytes (except zero).

Obviously, you can also misconfigure your environment:

$ LANG=C PYTHONIOENCODING=latin-1 python -c'import sys;
>   print(sys.argv, sys.stdin.encoding, sys.getfilesystemencoding())' €
(['-c', '\xe2\x82\xac'], 'latin-1', 'ANSI_X3.4-1968') # Linux output

The output shows that is encoded using utf-8 but both locale and PYTHONIOENCODING are configured differently.

The examples demonstrate that sys.argv may be encoded using a character encoding that does not correspond to any of the standard encodings or it even may contain arbitrary (except zero byte) binary data on POSIX (no character encoding). On Windows, I guess, you could paste a Unicode string that can't be encoded using ANSI or OEM Windows encodings but you might get the correct value using Unicode API anyway (Python 2 probably drops data here).

Python 3 uses Unicode sys.argv and therefore it shouldn't lose data on Windows (Unicode API is used) and it allows to demonstrate that sys.getfilesystemencoding() is used (not sys.stdin.encoding) to decode sys.argv on Linux (where sys.getfilesystemencoding() is derived from locale):

$ LANG=C.UTF-8 PYTHONIOENCODING=latin-1 python3 -c'import sys; print(*map(ascii, sys.argv))' µ
'-c' '\xb5'
$ LANG=C PYTHONIOENCODING=latin-1 python3 -c'import sys; print(*map(ascii, sys.argv))' µ
'-c' '\udcc2\udcb5'
$ LANG=en_US.ISO-8859-15 PYTHONIOENCODING=latin-1 python3 -c'import sys; print(*map(ascii, sys.argv))' µ
'-c' '\xc2\xb5'

The output shows that LANG that defines locale in this case that defines sys.getfilesystemencoding() on Linux is used to decode the command-line arguments:

$ python3
>>> print(ascii(b'\xc2\xb5'.decode('utf-8')))
'\xb5'
>>> print(ascii(b'\xc2\xb5'.decode('ascii', 'surrogateescape')))
'\udcc2\udcb5'
>>> print(ascii(b'\xc2\xb5'.decode('iso-8859-15')))
'\xc2\xb5'
Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • My command line arguments aren't file names. My use-case is text entered as an argument to a script. – orome Nov 21 '15 at 13:42
  • @raxacoricofallapatorius: and? – jfs Nov 21 '15 at 13:44
  • 1
    See other comments: Can you document that `sys.getfilesystemencoding()` and not `sys.stdin.encoding` is the correct encoding for command line arguments (given that they will not be file names)? – orome Nov 21 '15 at 13:46
  • Read the first paragraph of the answer. – jfs Nov 21 '15 at 13:48
  • Sorry, my browser must not be showing the link that's there. – orome Nov 21 '15 at 13:49
  • @raxacoricofallapatorius: does "document" mean "provide a reference"? If it does then for example, read the pep 383 – jfs Nov 21 '15 at 13:54
  • I'm stuck, because it appears there's *no way* to ensure that an encoding will work reliably. Perhaps it has something to do with the details of [my code](https://github.com/orome/crypto-enigma-py); but no matter which answer I use, I get errors with most encodings. – orome Nov 28 '15 at 15:05
  • @raxacoricofallapatorius: if you have a specific issue then [create minimal but complete code example that demonstrates it (like one-liners in my answer)](http://stackoverflow.com/help/mcve). Provide an example input/expected output and what you get instead (i.e., describe what do you expect to happen and what happens instead). Mention your OS, locale settings. Then update your question with this info or ask new one. – jfs Nov 28 '15 at 16:16