6

I'm using the OptParse module to retrieve a string value. OptParse only supports str typed strings, not unicode ones.

So let's say I start my script with:

./someScript --some-option ééééé

French characters, such as 'é', being typed str, trigger UnicodeDecodeErrors when read in the code:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 99: ordinal not in range(128)

I played around a bit with the unicode built-in function, but either I get an error, or the character disappears:

>>> unicode('é');
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>> unicode('é', errors='ignore');
u''

Is there anything I can do to use OptParse to retrieve unicode/utf-8 strings?

It seems that the string can be retrieved and printed OK, but then I try to use that string with SQLite (using the APSW module), and it tries to convert to unicode somehow with cursor.execute("..."), and then the error occurs.

Here is a sample program that causes the error:

#!/usr/bin/python
# coding: utf-8

import os, sys, optparse
parser = optparse.OptionParser()
parser.add_option("--some-option")
(opts, args) = parser.parse_args()
print unicode(opts.some_option)
VLAZ
  • 26,331
  • 9
  • 49
  • 67
  • Str objects are just byte stores so if the input is UTF-8, the string will hold the UTF-8 value. Where is the unicode error being thrown? – Alastair McCormack Oct 29 '12 at 12:55
  • I've just tested this on a UTF-8 console and optparse works fine and returns the character to the console. Can you clarify if this error is in your code or in the optparse? – Alastair McCormack Oct 29 '12 at 13:00
  • Depends your program on optparse or you are building from scratch? In that case I would recommend the docopts package instead of the optparse. You will be really really really surprised how easy it parses the cli arguments. – Bruce Oct 29 '12 at 13:02
  • @Fuzzyfelt: I've narrowed my question a bit thanks to your second comment. –  Oct 29 '12 at 13:09

4 Answers4

4

You could decode the arguments before the parser handles them. Taking your example:

#!/usr/bin/python
# coding: utf-8
import os, sys, optparse
parser = optparse.OptionParser()
parser.add_option("--some-option")

# Decode the command line arguments to unicode
for i, a in enumerate(sys.argv):
    sys.argv[i] = a.decode('ISO-8859-15')

(opts, args) = parser.parse_args()
print type(opts.some_option), opts.some_option

This gives the following output:

C:\workspace>python file.py --some-option préférer
<type 'unicode'> préférer

I've chose the ISO/IEC 8859-15 code page, as it seems most appropriate to you. Adapt if needed.

jro
  • 9,300
  • 2
  • 32
  • 37
  • 1
    To avoid to hardcode the encoding you could try to guess it like that : ``locale.getpreferredencoding()`` (import ``locale``). – Stan Nov 17 '14 at 08:45
  • By the way, you may need to make it unicode with ``unicode(a.decode("your_encoding_here"))``. – Stan Nov 17 '14 at 08:53
1

Input is returned in the console encoding, so based on your updated example, use:

print opts.some_option.decode(sys.stdin.encoding)

unicode(opts.some_option) defaults to using ascii as the encoding.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
0

I believe your error is related to the following:

For example, to write Unicode literals including the Euro currency symbol, the ISO-8859-15 encoding can be used, with the Euro symbol having the ordinal value 164. This script will print the value 8364 (the Unicode codepoint corresponding to the Euro symbol) and then exit:

# -*- coding: iso-8859-15 -*-

currency = u"€"
print ord(currency)
SilentGhost
  • 307,395
  • 66
  • 306
  • 293
Woot4Moo
  • 23,987
  • 16
  • 94
  • 151
0
#!/usr/bin/python
# coding: utf-8

import os, sys, optparse

reload(sys)
sys.setdefaultencoding('utf-8')

parser = optparse.OptionParser()
parser.add_option(u"--some-option")
(opts, args) = parser.parse_args()
print opts.print_help()