0

I'm using Python 2.7.3 on Ubuntu 12 x64.

I have about 200,000 files in a folder on my filesystem. The file names of some of the files contain html encoded and escaped characters because the files were originally downloaded from a website. Here are examples:

Jamaica%2008%20114.jpg
thai_trip_%E8%B0%83%E6%95%B4%E5%A4%A7%E5%B0%8F%20RAY_5313.jpg

I wrote a simple Python script that goes through the folder and renames all of the files with encoded characters in the filename. The new filename is achieved by simply decoding the string that makes up the filename.

The script works for most of the files, but, for some of the files Python chokes and spits out the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11: ordinal not in range(128)
Traceback (most recent call last):
  File "./download.py", line 53, in downloadGalleries
    numDownloaded = downloadGallery(opener, galleryLink)
  File "./download.py", line 75, in downloadGallery
    filePathPrefix = getFilePath(content)
  File "./download.py", line 90, in getFilePath
    return cleanupString(match.group(1).strip()) + '/' + cleanupString(match.group(2).strip())
  File "/home/abc/XYZ/common.py", line 22, in cleanupString
    return HTMLParser.HTMLParser().unescape(string)
  File "/usr/lib/python2.7/HTMLParser.py", line 472, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/usr/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)

Here is the contents of my cleanupString function:

def cleanupString(string):
    string = urllib2.unquote(string)

    return HTMLParser.HTMLParser().unescape(string)

And here's the snippet of code that calls the cleanupString function (this code is not the same code in the traceback above but it produces the same error):

rootFolder = sys.argv[1]
pattern = r'.*\.jpg\s*$|.*\.jpeg\s*$'
reobj = re.compile(pattern, re.IGNORECASE)
imgs = []

for root, dirs, files in os.walk(rootFolder):
    for filename in files:
        foundFile = os.path.join(root, filename)

        if reobj.match(foundFile):
            imgs.append(foundFile)

for img in imgs :
    print 'Checking file: ' + img
    newImg = cleanupString(img) #Code blows up here for some files

Can anyone provide me with a way to get around this error? I've already tried adding

# -*- coding: utf-8 -*-

to the top of the script but that has no effect.

Thanks.

Justin Kredible
  • 8,354
  • 15
  • 65
  • 91

2 Answers2

6

Your filenames are byte strings that contain UTF-8 bytes representing unicode characters. The HTML parser normally works with unicode data instead of byte strings, particularly when it encounters a ampersand escape, so Python is automatically trying to decode the value for you, but it by default uses ASCII for that decoding. This fails for UTF-8 data as it contains bytes that fall outside of the ASCII range.

You need to explicitly decode your string to a unicode object:

def cleanupString(string):
    string = urllib2.unquote(string).decode('utf8')

    return HTMLParser.HTMLParser().unescape(string)

Your next problem will be that you now have unicode filenames, but your filesystem will need some kind of encoding to work with these filenames. You can check what that encoding is with sys.getfilesystemencoding(); use this to re-encode your filenames:

def cleanupString(string):
    string = urllib2.unquote(string).decode('utf8')

    return HTMLParser.HTMLParser().unescape(string).encode(sys.getfilesystemencoding())

You can read up on how Python deals with Unicode in the Unicode HOWTO.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • You need to be wary when working with filenames on Linux. There is no set character encoding that is used, and even if one is explicity configured (usually UTF-8), you can still get filenames that don't conform to it. You need to treat them as raw byte strings generally speaking, or at least don't fall over if you get an invalid name. – spencercw Sep 27 '12 at 16:51
  • Come to think of it, his encoded filenames are not even necessarily UTF-8, which could make things interesting. – spencercw Sep 27 '12 at 16:54
  • @spencercw: The example he gave is. The `\xe2` byte in his error message is another clue, that's a typical UTF-8 surrogate. – Martijn Pieters Sep 27 '12 at 16:56
  • This solution works on my the test files I created to reproduce the problem. Now I'll incorporate the solution into my main script. Thanks. – Justin Kredible Sep 27 '12 at 17:46
0

Looks like you're bumping into this issue. I would try reversing the order you call unescape and unquote, since unquote would be adding non-ASCII characters into your filenames, although that may not fix the problem.

What is the actual filename it is choking on?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
spencercw
  • 3,320
  • 15
  • 20