5

I deal with large collection of unknown files, and have been been learning python to help me filter / sort and otherwise wrangle these files.

A collection I am looking at has a large number of resource forks, and I wrote a little script to find them, and delete them (next step is find them, and to move them, but thats for another day).

I found in this collection that there is a number of files that have non ascii characters in the file name, and this seems to be tripping the os.delete function.

Example file name: ._spec com report 395 (N.B. the 3 has a small dot underneath it, I can't find an example, or figure out how to show the hex of the filename...)

I log all the filenames, this is what that log records for that file: ._spec com report 3?95

The error I get is a windowserror, as it can't find the file (the string its passing is not what the file is known as by the windows OS.) I put in a try clause to allow me to work rounf it, but I really like to deal with it properly.

I also tried using a unicode switch in the walk option `os.walk(u'.') as per this post: Handling ascii char in python string (top answer) and I see the following error:

Traceback (most recent call last):
 File "<stdin>", line 3, in <module>
 File "c:\python27\lib\encodings\cp850.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\uf022' in position
20: character maps to <undefined>

So I am guessing the answer lies with how the filename is parsed, and wondering if anyone might be able to point in me in the right direction...

code:

import os
import sys

rootdir = "c:\target Dir to walk"
destKeep = "Keepers.txt"
destDelete = "Deleted.txt"

matchingText = "._"
files_removed = 1
for folder, subs, files in os.walk(rootdir):  
    outfileKeep = open(destKeep,"a")
    outfileDelete = open(destDelete,"a")
    for filename in files:
        matchScore = filename.find(matchingText)
        src = os.path.join(folder, filename)
        srcNewline = src + ", " + str(filename) + "\n"
        if matchScore == -1:
        outfileKeep.writelines(srcNewline)
        else: 
            outfileDelete.writelines(srcNewline)
            try:
                os.remove(src)
        except WindowsError:
                print "I was unable to delete this file:"
                outfileKeep.writelines(srcNewline)
            files_removed += 1
            if files_removed:
                print '%d files removed' % files_removed
            else :
                print 'No files removed'
    outfileKeep.close()
    outfileDelete.close()
Community
  • 1
  • 1
Jay
  • 753
  • 3
  • 11
  • 19

1 Answers1

3

os.walk(u'.') is the normal way to get native-Unicode filenames and it should work fine; it does for me.

Your problem is here instead:

srcNewline = src + ", " + str(filename) + "\n"

str(filename) will use the default encoding to convert your Unicode string back down to bytes, and because that encoding doesn't have the character U+F022(*) you get a UnicodeEncodeError. You will have to choose what encoding you want to store in your output file by doing eg srcNewLine= '%s, %s\n' % (src, filename.encode('utf-8')), or (perhaps better) keeping your strings as Unicode and writing them to the file using a codecs.opened file.

(*: which is a Private Use Area character that shouldn't be used, but not much you can do about that now I guess...)

bobince
  • 528,062
  • 107
  • 651
  • 834
  • Hey, thanks for the reply. I understand most of what you've said, and have been playing around with your suggestions. I still can't get it to work - and I think I am a little closer to understanding the problem... The issue seems to be how the OS layer is dealing with the filename, which is different to how MSDOS based functions work on the filename. Essentially there is a 2 byte character in the filename (of unknown encoding) that explorer can 'see' but is stripped and masked MSDOS. It seems the passing of this character is the issue - perhaps I should be looking at a bitstream not a string? – Jay Oct 09 '11 at 22:37
  • I also found that the glyphs in question comprised of an ASCII number, followed by a hex code of EF 80 A2. I discovered this by viewing the folder in firefox and viewing the source. What is interesting is that I can see numerical glyphs for a few numbers, each with the same following code, suggesting a 4 byte word for the character - UTF-16? – Jay Oct 10 '11 at 02:41
  • The byte sequence `EF 80 A2` is the UTF-8 encoding of U+F022. That's how I would expect `.encode('utf-8')` to have worked when outputting to the Keepers.txt/Deleted.txt files, so that's fine. You shouldn't use the encoded-to-bytes string to actually access the file, though, because when you use byte strings for filenames you are limited to the system default code page (which won't have most Unicode characters in and certainly not the bogus U+F022 character). Keep filenames as Unicode, only encode them to bytes at the point you write them out to a byte stream. – bobince Oct 10 '11 at 09:15
  • Awesome, thank you. I did lots of background reading on the U+F022 glyph and diacritics in general. This all makes total sense. Thank you for your time, I appreciate it. :) – Jay Oct 10 '11 at 19:52