1

I'm trying to find a reliable way to scan files on Windows in Python, while allowing for the possibility that there may be various Unicode code points in the filenames. I've seen several proposed solutions to this problem, but none of them work for all of the actual issues that I've encountered in scanning filenames created by real-world software and users.

The code sample below is an attempt to extricate and demonstrate the core issue. It creates three files in a subfolder with the sorts of variations I've encountered, and then attempts to scan through that folder and display each filename followed by the file's contents. It will crash on the attempt to read the third test file, with OSError [Errno 22] Invalid argument.

import os

# create files in .\temp that demonstrate various issues encountered in the wild
tempfolder = os.getcwd() + '\\temp'
if not os.path.exists(tempfolder):
    os.makedirs(tempfolder)
print('file contents', file=open('temp/simple.txt','w'))
print('file contents', file=open('temp/with a ® symbol.txt','w'))
print('file contents', file=open('temp/with these chars ΣΑΠΦΩ.txt','w'))

# goal is to scan the files in a manner that allows for printing
# the filename as well as opening/reading the file ...
for root,dirs,files in os.walk(tempfolder.encode('UTF-8')):
    for filename in files:
        fullname = os.path.join(tempfolder.encode('UTF-8'), filename)
        print(fullname)
        print(open(fullname,'r').read())

As it says in the code, I just want to be able to display the filenames and open/read the files. Regarding display of the filename, I don't care whether the Unicode characters are rendered correctly for the special cases. I just want to print the filename in a manner that uniquely identifies which file is being processed, and doesn't throw an error for these unusual sorts of filenames.

If you comment out the final line of code, the approach shown here will display all three filenames with no errors. But it won't open the file with miscellaneous Unicode in the name.

Is there a single approach that will reliably display/open all three of these filename variations in Python? I'm hoping there is, and my limited grasp of Unicode subtleties is preventing me from seeing it.

Doug Mahugh
  • 624
  • 6
  • 15
  • where are you running the code from? – Padraic Cunningham Nov 22 '15 at 18:42
  • From a Command Prompt, or from within VS Code, same error in both cases. I need to run it from a Command Prompt when it's done. – Doug Mahugh Nov 22 '15 at 18:48
  • Why the `utf-8` encoding? This should raise type errors on python 3.x. Sure you aren't running it in 2.x? Try 3.x and remove the `.encode('utf-8')` bits. – tdelaney Nov 22 '15 at 19:12
  • @Doug, is running from a cmd prompt a necessity? – Padraic Cunningham Nov 22 '15 at 19:14
  • @tdelaney, decoding would cause an error not encoding. – Padraic Cunningham Nov 22 '15 at 19:16
  • @PadraicCunningham `os.path.join('somedir'.encode('utf-8'), 'somefile')` results in `TypeError: Can't mix strings and bytes in path components`. In python 3.x, OP would be passing a `bytes` object to the file system functions which won't work. He would be getting an entirely different error. The point is that OP shouldn't be encoding the strings. – tdelaney Nov 22 '15 at 19:20
  • @tdelaney, they are both bytes, I would only see an error if you `os.walk(tempfolder): ` using the given example, the root cause is also most likely unrelated – Padraic Cunningham Nov 22 '15 at 19:27
  • @PadraicCunningham I think OP is running this on python 2.x. In 2.x, encoding a string returns another `str`. By the time we get to the `open` call, `fullname` is encoded but when `open` calls down to the operating system, it tries to expand each byte to a wide char (basically encoding it a second time) but one of the extended encoding bytes is an invalid code point so the operation fails with ` OSError [Errno 22] Invalid argument.`. – tdelaney Nov 22 '15 at 19:31
  • @PadraicCunningham `tempfolder = os.getcwd() + '\\temp'` - tempfolder is a string and `os.walk(tempfolder)` is the proper way to do it. – tdelaney Nov 22 '15 at 19:35
  • @tdelaney, `tempfolder.encode('UTF-8')` makes it a bytes object, anyway I am more surprised the files get created at all as using code page 850 I get two questions marks in the filename which would give `OSError [Errno 22] Invalid argument`, unless the cmd shell has the correct encoding then it is going to fail – Padraic Cunningham Nov 22 '15 at 19:40
  • Appreciate the comments, but it's not clear what I can do to make this work, and I believe I've already tried everything suggested. For example, simply removing the encode('utf-8') from the two places it appears in my sample code just makes it crash in a different way, with a UnicodeEncodeError on file #2 instead of the OSError on file #3. @padraic, I'm open to a solution that doesn't use the command prompt, if it's an approach that can meet my two requirements: displaying the filename (even if not rendered perfectly), and being able to open/read the file. – Doug Mahugh Nov 22 '15 at 19:45
  • @DougMahugh That's a different problem, you generally can't **print** Unicode text on the Windows console. – roeland Nov 22 '15 at 19:47
  • 1
    @DougMahugh, using an ide like pycharm or cygwin will save you a lot of headaches, the code should run and display the output perfectly , the cmd shell is a pain when it comes to encodings. https://www.cygwin.com/, https://www.jetbrains.com/pycharm/download/ – Padraic Cunningham Nov 22 '15 at 19:48
  • One other question... do you have some sort of encoding marker on the file such as the utf-8 signature ('\xef\xbb\xbf') or `# -*- coding: latin-1 -*-`? I'm not how I'm even seeing non-ascii characters in the strings in the first place! [PEP 263](https://www.python.org/dev/peps/pep-0263/) addresses issues with non-ascii encoding in python scripts. – tdelaney Nov 22 '15 at 19:50
  • That's a good question, @tdelaney, and I'm not sure of the answer. I just went into Windows Explorer and copied those characters from the filenames of files that were crashing my script, then pasted them into this sample code in VS Code. The characters show up fine in both VS Code and Notepad, if that provides a clue. – Doug Mahugh Nov 22 '15 at 19:53
  • @DougMahugh Trying peeking at the front of the file... `open('myscript.py', 'rb').read(5)` and see if it starts with non-ascii stuff. Microsoft editors like to but binary encoding indicators (BOM and etc) at the front of files. – tdelaney Nov 22 '15 at 19:56
  • As for the output, if all you need is to print something unique, that part isn't an insurmountable problem: I usually just convert anything outside the ASCII range into a hexadecimal representation. I don't know whether Python supports Unicode file names or not. One note: there's another case you aren't testing, files whose names are invalid UTF-16 sequences, cf [my blog post here](https://harryjohnston.wordpress.com/2014/12/11/robocopy-can-silently-fail-to-copy-directories-with-invalid-utf-16-names-or-why-i-always-compare-after-copying/). – Harry Johnston Nov 22 '15 at 19:56
  • @tdelaney, I tried that, and read(5) just returns b'impor' -- i.e., the file seems to just start with the first character of the first line of code. I have no idea what's going on there, but it's not actually relevant to my core issue, which is that I have files with these characters in the names and need to find a way to gracefully deal with them. I agree that it's an extremely bad practice to create such filenames, but across four Windows machines I've scanned, roughly 1 in 20,000 files has this stuff in it. Some of those files were created by commercial software, and none of them by me. – Doug Mahugh Nov 22 '15 at 20:02
  • 1
    @DougMahugh, you can spend a few hours trying to get a solution that allows you to use a cmd shell but you will find that won't ever work properly or you can just spend 15 minutes setting up cygwin or an ide that supports utf-8 that will just work. – Padraic Cunningham Nov 22 '15 at 20:05
  • I hear you on PyCharm/Cygwin, but I'm hoping to find a solution that doesn't require those sorts of dependencies for my script. – Doug Mahugh Nov 22 '15 at 20:53
  • @HarryJohnston Windows supports any Unicode in file names (apart from a few characters with special meaning like `\`, `/` , `*` etc), and Python will correctly handle those, as long as you use unicode strings as file names. – roeland Nov 22 '15 at 21:25
  • @DougMahugh it's not too hard, see http://stackoverflow.com/questions/5419/python-unicode-and-the-windows-console . For *quick & dirty testing*, you can use a hack like `"abc\u2012".encode("mbcs", errors="replace").decode("mbcs")` to filter those characters. – roeland Nov 22 '15 at 21:30
  • @roeland: good to know. I suspect you might run into trouble with invalid UTF-16 sequences (e.g., an unpaired surrogate) since presumably Python is internally storing the strings as UTF-8? It'll depend on how it does the conversion. – Harry Johnston Nov 22 '15 at 21:38
  • 1
    @roeland, 'mbcs' is generally incorrect, since the console defaults to the OEM codepage, not the ANSI codepage. Use `sys.stdout.encoding`. – Eryk Sun Nov 23 '15 at 01:22
  • @eryksun you're right. `sys.stdout.encoding` will work. – roeland Nov 23 '15 at 03:13

1 Answers1

4

The following works fine, if you save the file in the declared encoding, and if you use an IDE or terminal encoding that supports the characters being displayed. Note that this does not have to be UTF-8. The declaration at the top of the file is the encoding of the source file only.

#coding:utf8
import os

# create files in .\temp that demonstrate various issues encountered in the wild
tempfolder = os.path.join(os.getcwd(),'temp')
if not os.path.exists(tempfolder):
    os.makedirs(tempfolder)
print('file contents', file=open('temp/simple.txt','w'))
print('file contents', file=open('temp/with a ® symbol.txt','w'))
print('file contents', file=open('temp/with these chars ΣΑΠΦΩ.txt','w'))

# goal is to scan the files in a manner that allows for printing
# the filename as well as opening/reading the file ...
for root,dirs,files in os.walk(tempfolder):
    for filename in files:
        fullname = os.path.join(tempfolder, filename)
        print(fullname)
        print(open(fullname,'r').read())

Output:

c:\\temp\simple.txt
file contents

c:\temp\with a ® symbol.txt
file contents

c:\temp\with these chars ΣΑΠΦΩ.txt
file contents

If you use a terminal that does not support encoding the characters used in the filename, You will get UnicodeEncodeError. Change:

print(fullname)

to:

print(ascii(fullname))

and you will see that the filename was read correctly, but just couldn't print one or more symbols in the terminal encoding:

'C:\\temp\\simple.txt'
file contents

'C:\\temp\\with a \xae symbol.txt'
file contents

'C:\\temp\\with these chars \u03a3\u0391\u03a0\u03a6\u03a9.txt'
file contents
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • Great answer but wouldn't utf-16 be more appropriate for Windows? – tdelaney Nov 22 '15 at 20:06
  • @tdelaney, the encoding declared is the encoding of the source file. It has nothing to do with the file system. Updated answer to make that more clear. – Mark Tolonen Nov 22 '15 at 20:24
  • Thanks @MarkTolonen - I just tried this saving your example as UTF-8 from Notepad but Python complains about "UTF-8 with BOM" so I'll need to figure out how to save it as UTF-9 without BOM and then I'll reply and mark this answered if that works. Out of time, need to be away from the keyboard for a couple hours. – Doug Mahugh Nov 22 '15 at 20:52
  • 1
    @MarkTolonen - I was thinking about the need for a utf-8 terminal. I naively assumed that python3 stdout encoding would be utf-16 since the console is native wide char. Turns out that it is still codepage based. OP could do, for instance, `print(fullname.encode(sys.stdout.encoding, 'replace')) on a regular windows console. Characters not in the code page would display as "?" but it would otherwise be harmless. And no need to restrict the execution environment of the script. – tdelaney Nov 22 '15 at 21:44
  • Thanks, @tdelaney! That works great at the command prompt, just what I wanted. – Doug Mahugh Nov 23 '15 at 00:03
  • 1
    @tdelaney, Unicode in the console (issue 1602) is an old unresolved issue. `FileIO` is based on bytes and POSIX `read` and `write`. It's not appropriate for the console. It needs a `RawIOBase` subclass that calls `ReadConsoleW` and `WriteConsoleW`. [win-unicode-console](https://github.com/Drekin/win-unicode-console) is an example. It needs better integration with the REPL and tokenizer, and to provide the APIs (maybe in `_winapi`) and hooks necessary to implement a readline module without ctypes. – Eryk Sun Nov 23 '15 at 01:18
  • 1
    @DougMahugh, If you save source with `UTF-8 with BOM`, use `#coding:utf-8-sig` instead, or just leave it out, as without an encoding declared, UTF-8 (with or without BOM) is the default on Python 3. I use it because I use a Python editor that saves the source automatically in whatever encoding I declare, but it defaults (incorrectly) to "ANSI" (`Windows-1252` on US Windows) if left out. – Mark Tolonen Nov 23 '15 at 09:41