0

I'm a newbie, and I'm sure a similar question has been asked in the past, but I am having trouble finding/understanding an answer. Thank you in advance for being patient with me!

So I'm trying to write a script to read lines in a utf-8 encoded input file, compare portions of it to an optional command line argument passed in by the user, and if there's a match, to do some stuff to that line before printing it to an output file. I'm using codecs to open the files.

I'm using the argparse module to parse command line arguments right now. The lines in the file can be in all sorts of languages, hence the command line argument needs to also be utf-8.

For example:

A line from the file might look like this:

разъедают {. r ax z . j je . d ax1 . ju t .}

The script should be called from the command line with something like this:

>python myscript.py mytextfile.txt -grapheme ъ

Here's the part of my code that is supposed to do the processing. In this case, orth is some Cyrillic text and grapheme is a Cyrillic character.

def process_orth(orth, grapheme):
    grapheme = grapheme.decode(sys.stdin.encoding).encode('utf-8')
    if (grapheme in orth):
        print 'success, your grapheme was: ' + grapheme.encode('utf-8')
        return True
    else:
        print 'failure, your grapheme was: ' + grapheme.encode('utf-8')
        return False

Unfortunately, even though the grapheme is definitely there, the function returns false and prints a question mark instead of the grapheme:

failure, your grapheme was: ?

I've tried adding the following at the start of process_orth() as per the recommendation of some other post I read, but it didn't seem to work:

grapheme.decode(sys.stdin.encoding).encode('utf-8')

So my question is...

How do I pass utf-8 strings through the command line into a python script? Also, are there any extra quirks with this on Windows7 (and does having cygwin installed change anything)?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
KaleidoEscape
  • 115
  • 1
  • 13
  • `print repr(orth)` gives me `u'\u0440\u0430\u0437\u044a\u0435\u0434\u0430\u044e\u0442'`, and `print repr(grapheme)` gives me `'?'` – KaleidoEscape May 24 '13 at 23:38

1 Answers1

3

If you are opening the input file using codecs.open() then you have unicode data, not encoded data. You would want to just decode grapheme, not encode it again to UTF-8:

grapheme = grapheme.decode(sys.stdin.encoding)
if grapheme in orth:
    print u'success, your grapheme was: ' + grapheme
    return True

Note that we print unicode as well; normally print will ensure that Unicode values are encoded again for your current codepage. This can still fail as Windows console printing is notoriously difficult, see http://wiki.python.org/moin/PrintFails.

Unfortunately, sys.argv on Windows can apparently end up garbled, as Python uses a non-unicode aware system call. See Read Unicode characters from command-line arguments in Python 2.x on Windows for a unicode-aware alternative.

I see no reason for argparse to have any problems with Unicode input, but if it does, you can always take the unicode output from win32_unicode_argv() and encode it to UTF-8 before passing it to argparse.

Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thanks for your response. I just tried that (just having decode), but it didn't change anything except that `print repr(grapheme)` now gives me `u'?'` – KaleidoEscape May 24 '13 at 23:41
  • @KaleidoEscape: That is an ascii question mark; either that was literally what was passed in or you used `.decode(sys.stdin.encoding, 'replace')` to ignore unknown characters. What does `repr(sys.argv)` tell you Windows passed in? What does `print sys.stdin.encoding` tell you is the expected codec? – Martijn Pieters May 24 '13 at 23:43
  • I didn't use `.decode(sys.stdin.encoding, 'replace')`, at least consciously, heh. I am using the `argparse` module, maybe it does this? `'repr(sys.argv)` tells me `['myscript.py', 'test.txt', '-graph', '?']` and `print sys.stdin.encoding` says `cp437` (but of course I did not type a question mark when running the script, I typed ъ). – KaleidoEscape May 24 '13 at 23:50
  • @KaleidoEscape: Then **windows** passed in `'?'` as an argument, I'm afraid. – Martijn Pieters May 24 '13 at 23:53
  • @KaleidoEscape: See [Read Unicode characters from command-line arguments in Python 2.x on Windows](http://stackoverflow.com/q/846850) for a work-around. – Martijn Pieters May 24 '13 at 23:54