0

I'm using python's argparse module to process command line arguments. I am having a problem on decoding actual unicode file names/file paths. Here's my code:

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-f", dest="file", default="", help="file to be processed")
    options = parser.parse_args()
    main(options)

def main(options):
    detail = options.file.decode(sys.stderr.encoding)
    print os.path.exists(detail)
    print detail

Now, when I run the script via Windows command line:

sample.py -f "c:\temp\2-¡¢£¤¥¦§¨©ª«¬®¯°±²³´μ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"

I am getting this as the result:

c:\temp\2-íóúñѪº¿⌐¬½¼?«»░▒▓│┤╡╢╖╕╣║╗╝╜╛┐└┴┬├─┼╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀αßΓπΣσµτΦΘΩδ∞φε∩≡±≥≤⌠⌡÷≈°∙·√ⁿ²■ 
False

As you can see, the decoded file name is different, resulting to a "False" in file existence checking.

Any ideas on solving this? Thanks in advance!

Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
jaysonpryde
  • 2,733
  • 11
  • 44
  • 61

2 Answers2

0

To understand what's going on: Python 2 has a couple of bugs related to the command line, encoding and Windows (ex: subprocess.call fails with unicode strings in command line)

Golden rule for issues like this: Use repr() and print everything as ASCII strings (using Unicode escapes). Otherwise, data may become mangled while printing, adding to the confusion.

I suggest to start with a more simple file name (C:\temp\ä.txt) which should give C:\\temp\\\u00e4.txt.

So the first step is to find out what the input is:

print type(options.file)

If that's not Unicode, then you never got a file name which was properly encoded. To fix this, you need to use the encoding which Windows used to pass you the file name. Try sys.stdin.encoding and 'mbcs' (= Windows filesystem encoding).

Print the string using repr() until it looks correct.

PEP 277 explains how Python handles Unicode file names on Windows.

In a nutshell, you must make sure you pass Unicode strings (type() == unicode) and not byte strings (type() == str) to open().

Related:

Community
  • 1
  • 1
Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
  • for the simple filename (C:\temp\ä.exe), it works. However, for the sample filename i gave, it's not – jaysonpryde Jul 03 '14 at 12:49
  • Unless you tell my why it's not working (error message, commands you tried), I can't help you. – Aaron Digulla Jul 03 '14 at 13:14
  • No error messages. I just used os.path.exists(options.file) for both "C:\temp\ä.exe" and "c:\temp\2-¡¢£¤¥¦§¨©ª«¬®¯°±²³´μ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ .exe". Former results to 'True', while the latter results to 'False' even if the file is existing – jaysonpryde Jul 03 '14 at 13:22
  • How did you create this file? On Windows, it's possible to create files with names that you can't open afterwards. For example, the broken bar (¦) might be illegal in file names. – Aaron Digulla Jul 03 '14 at 14:47
  • Using control panel > Region and Language, you can add languages and keyboard support. Once done, a language bar will be displayed in your taskbar, which will allow you to modify the filenames with such characters – jaysonpryde Jul 03 '14 at 15:03
  • And you changed it inside of Windows Explorer? Or from the command line? Can you open the file in Notepad? – Aaron Digulla Jul 03 '14 at 15:15
  • changed it in Windows Explorer and yes, I can open the file in notepad – jaysonpryde Jul 03 '14 at 15:22
  • I'm running out of ideas; maybe you can try it with Python 3 - at least just to see if the code would work there. – Aaron Digulla Jul 03 '14 at 15:27
0

START OF UPDATE

Considering hpaul's feedback below, as well as the bug link he pointed out, I was able to resolve the problem by converting the sys.argv[1:] arguments to unicode using this function:

def win32_unicode_argv():
    from ctypes import POINTER, byref, cdll, c_int, windll
    from ctypes.wintypes import LPCWSTR, LPWSTR
    GetCommandLineW = cdll.kernel32.GetCommandLineW
    GetCommandLineW.argtypes = []
    GetCommandLineW.restype = LPCWSTR
    CommandLineToArgvW = windll.shell32.CommandLineToArgvW
    CommandLineToArgvW.argtypes = [LPCWSTR, POINTER(c_int)]
    CommandLineToArgvW.restype = POINTER(LPWSTR)
    cmd = GetCommandLineW()
    argc = c_int(0)
    argv = CommandLineToArgvW(cmd, byref(argc))
    if argc.value > 0:
        # Remove Python executable and commands if present
        start = argc.value - len(sys.argv)
        return [argv[i] for i in
                xrange(start, argc.value)]

if __name__ == "__main__":
    sys.argv = win32_unicode_argv()

Obviously, this only works for Windows, but I think this is not necessary for scripts running under Linux.

END OF UPDATE

As recommended by Aaron, I tried to make sure that I encoded this in unicode so I did this:

parser.add_argument("-f", dest="file", type=lambda s : unicode(s, sys.getfilesystemencoding()), default="", help="file to be processed")

When I print the type, it says unicode:

print type(options.file)
<type 'unicode'>

However, when I did existence check again, result is still a False. I tried the following:

print os.path.exists(repr(options.file))

Results to False

print os.path.exists(repr(options.file.decode("utf8")))

Results to:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 25-36: ordinal not in range(128)
jaysonpryde
  • 2,733
  • 11
  • 44
  • 61
  • As I said, you have to use `os.path.exists(options.file)` - all file operations need a path of type `unicode` or they fail. Also, using `unicode.decode()` doesn't make sense - the data is already Unicode. Use `unicode.encode('utf-8')` to convert Unicode to bytes. – Aaron Digulla Jul 03 '14 at 13:12
  • I already used os.path.exists(options.file) for both "C:\temp\ä.exe" and "c:\temp\2-¡¢£¤¥¦§¨©ª«¬®¯°±²³´μ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäå‌​æçèéêëìíîïðñòóôõö÷øùúûüýþÿ .exe". Former results to 'True', while the latter results to 'False' even if the file is existing – jaysonpryde Jul 03 '14 at 13:25