I'm a newbie, and I'm sure a similar question has been asked in the past, but I am having trouble finding/understanding an answer. Thank you in advance for being patient with me!
So I'm trying to write a script to read lines in a utf-8 encoded input file, compare portions of it to an optional command line argument passed in by the user, and if there's a match, to do some stuff to that line before printing it to an output file. I'm using codecs
to open the files.
I'm using the argparse
module to parse command line arguments right now. The lines in the file can be in all sorts of languages, hence the command line argument needs to also be utf-8.
For example:
A line from the file might look like this:
разъедают {. r ax z . j je . d ax1 . ju t .}
The script should be called from the command line with something like this:
>python myscript.py mytextfile.txt -grapheme ъ
Here's the part of my code that is supposed to do the processing. In this case, orth
is some Cyrillic text and grapheme
is a Cyrillic character.
def process_orth(orth, grapheme):
grapheme = grapheme.decode(sys.stdin.encoding).encode('utf-8')
if (grapheme in orth):
print 'success, your grapheme was: ' + grapheme.encode('utf-8')
return True
else:
print 'failure, your grapheme was: ' + grapheme.encode('utf-8')
return False
Unfortunately, even though the grapheme is definitely there, the function returns false and prints a question mark instead of the grapheme:
failure, your grapheme was: ?
I've tried adding the following at the start of process_orth()
as per the recommendation of some other post I read, but it didn't seem to work:
grapheme.decode(sys.stdin.encoding).encode('utf-8')
So my question is...
How do I pass utf-8 strings through the command line into a python script? Also, are there any extra quirks with this on Windows7 (and does having cygwin installed change anything)?