I'm using Python 2.7.3 on Ubuntu 12 x64.
I have about 200,000 files in a folder on my filesystem. The file names of some of the files contain html encoded and escaped characters because the files were originally downloaded from a website. Here are examples:
Jamaica%2008%20114.jpg
thai_trip_%E8%B0%83%E6%95%B4%E5%A4%A7%E5%B0%8F%20RAY_5313.jpg
I wrote a simple Python script that goes through the folder and renames all of the files with encoded characters in the filename. The new filename is achieved by simply decoding the string that makes up the filename.
The script works for most of the files, but, for some of the files Python chokes and spits out the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11: ordinal not in range(128)
Traceback (most recent call last):
File "./download.py", line 53, in downloadGalleries
numDownloaded = downloadGallery(opener, galleryLink)
File "./download.py", line 75, in downloadGallery
filePathPrefix = getFilePath(content)
File "./download.py", line 90, in getFilePath
return cleanupString(match.group(1).strip()) + '/' + cleanupString(match.group(2).strip())
File "/home/abc/XYZ/common.py", line 22, in cleanupString
return HTMLParser.HTMLParser().unescape(string)
File "/usr/lib/python2.7/HTMLParser.py", line 472, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
Here is the contents of my cleanupString function:
def cleanupString(string):
string = urllib2.unquote(string)
return HTMLParser.HTMLParser().unescape(string)
And here's the snippet of code that calls the cleanupString function (this code is not the same code in the traceback above but it produces the same error):
rootFolder = sys.argv[1]
pattern = r'.*\.jpg\s*$|.*\.jpeg\s*$'
reobj = re.compile(pattern, re.IGNORECASE)
imgs = []
for root, dirs, files in os.walk(rootFolder):
for filename in files:
foundFile = os.path.join(root, filename)
if reobj.match(foundFile):
imgs.append(foundFile)
for img in imgs :
print 'Checking file: ' + img
newImg = cleanupString(img) #Code blows up here for some files
Can anyone provide me with a way to get around this error? I've already tried adding
# -*- coding: utf-8 -*-
to the top of the script but that has no effect.
Thanks.