I've written a basic program to check through a directory tree containing many jpeg files (500000+) verify that they are not corrupted (approximately 3-5% of the files seem to be corrupt in some way) and then take a sha1sum of the files (even the corrupt ones) and save the info into a database.
The jpeg files in question are located on a windows system and mounted on the linux box via cifs. They are mostly around 4 megabytes in size, although some maybe slightly larger or smaller.
When I run the program it seems to work fairly well for a while and then it falls over with the below error. This was after it had processed approximately 1100 files (the error indicated that the problem occurred when attempting to open a file of 4.5 meg).
Now I understand that I can catch this error and continue or retry etc but I'm curious as to why it is occurring in the first place and if catching and retrying is actually going to solve the problem - or will it just get stuck retrying (unless I limit the retries of course but then a file is being skipped).
I'm using "Python 2.7.5+" on a debian system to run this. The system has at least 4 Gig (possibly 8) of ram and top is reporting that the script is using less than 1% of the ram and less than 3% of the cpu at any time when it is running. Similarly jpeginfo which this script runs is also using equally small amounts of memory and cpu.
To avoid using too much memory when reading files in I have taken the approach given in this answer to another question: https://stackoverflow.com/a/1131255/289545
Also you may note that the "jpeginfo" command is in a while loop looking for an "[OK]" response. This is because if "jpeginfo" thinks it can't find the file it returns a 0 and so it is not considered an error state by the subprocess.check_output call.
I did wonder if the fact that jpeginfo seems to fail to find certain files on the first try could be related (and I suspect it is) but the error returned says cannot allocate memory rather than file not found.
The Error:
Traceback (most recent call last):
File "/home/m3z/jpeg_tester", line 95, in <module>
main()
File "/home/m3z/jpeg_tester", line 32, in __init__
self.recurse(self.args.dir, self.scan)
File "/home/m3z/jpeg_tester", line 87, in recurse
cmd(os.path.join(root, name))
File "/home/m3z/jpeg_tester", line 69, in scan
with open(filepath) as f:
IOError: [Errno 12] Cannot allocate memory: '/path/to/file name.jpg'
The full program code:
1 #!/usr/bin/env python
2
3 import os
4 import time
5 import subprocess
6 import argparse
7 import hashlib
8 import oursql as sql
9
10
11
12 class main:
13 def __init__(self):
14 parser = argparse.ArgumentParser(description='Check jpeg files in a given directory for errors')
15 parser.add_argument('dir',action='store', help="absolute path to the directory to check")
16 parser.add_argument('-r, --recurse', dest="recurse", action='store_true', help="should we check subdirectories")
17 parser.add_argument('-s, --scan', dest="scan", action='store_true', help="initiate scan?")
18 parser.add_argument('-i, --index', dest="index", action='store_true', help="should we index the files?")
19
20 self.args = parser.parse_args()
21 self.results = []
22
23 if not self.args.dir.startswith("/"):
24 print "dir must be absolute"
25 quit()
26
27 if self.args.index:
28 self.db = sql.connect(host="localhost",user="...",passwd="...",db="fileindex")
29 self.cursor = self.db.cursor()
30
31 if self.args.recurse:
32 self.recurse(self.args.dir, self.scan)
33 else:
34 self.scan(self.args.dir)
35
36 if self.db:
37 self.db.close()
38
39 for line in self.results:
40 print line
41
42
43
44 def scan(self, dirpath):
45 print "Scanning %s" % (dirpath)
46 filelist = os.listdir(dirpath)
47 filelist.sort()
48 total = len(filelist)
49 index = 0
50 for filen in filelist:
51 if filen.lower().endswith(".jpg") or filen.lower().endswith(".jpeg"):
52 filepath = os.path.join(dirpath, filen)
53 index = index+1
54 if self.args.scan:
55 try:
56 procresult = subprocess.check_output(['jpeginfo','-c',filepath]).strip()
57 while "[OK]" not in procresult:
58 time.sleep(0.5)
59 print "\tRetrying %s" % (filepath)
60 procresult = subprocess.check_output(['jpeginfo','-c',filepath]).strip()
61 print "%s/%s: %s" % ('{:>5}'.format(str(index)),total,procresult)
62 except subprocess.CalledProcessError, e:
63 os.renames(filepath, os.path.join(dirpath, "dodgy",filen))
64 filepath = os.path.join(dirpath, "dodgy", filen)
65 self.results.append("Trouble with: %s" % (filepath))
66 print "%s/%s: %s" % ('{:>5}'.format(str(index)),total,e.output.strip())
67 if self.args.index:
68 sha1 = hashlib.sha1()
69 with open(filepath) as f:
70 while True:
71 data = f.read(8192)
72 if not data:
73 break
74 sha1.update(data)
75 sqlcmd = ("INSERT INTO `index` (`sha1`,`path`,`filename`) VALUES (?, ?, ?);", (buffer(sha1.digest()), dirpath, filen))
76 self.cursor.execute(*sqlcmd)
77
78
79 def recurse(self, dirpath, cmd, on_files=False):
80 for root, dirs, files in os.walk(dirpath):
81 if on_files:
82 for name in files:
83 cmd(os.path.join(root, name))
84 else:
85 cmd(root)
86 for name in dirs:
87 cmd(os.path.join(root, name))
88
89
90
91
92
93
94 if __name__ == "__main__":
95 main()