Python bottlenecks; Determining optimum chunk size for a file comparison function

Question

I'm writing a file comparison function. I know of filecmp.cmp but in my dataset it is expected that a lot of the files will be the same so I thought rather than comparing each potential match with each other it would be better to implement a multi-file comparison that can compare them all at once. (Also, since I'm new to python I thought it was a good learning exercise.) It seems to be going O.K. so far, in fact with some inputs it is faster then unix's cmp (which actually has me a little worried, because I don't quite believe that's possible and therefore think there maybe something wrong with my implementation!)

So, I have the code written, but I'm now trying to determine what the ideal chunk size for each read would be. Part of me is thinking that the data retrieved will all have to be compared anyway so, the more I can get into memory at one time the better, but I wonder if there are limitations of the python data structures that may influence this above that. For example, I'm maintaing, potentially large, lists of chunks and using dictionaries where the keys are the read chunks.

So, what should I be aware of in the python built-in data structures that may affect this, or is this something that will only be determined by hardware and should be determined by profiling on a particular machine?

Reading that back I realise it's not the most clear question, but (despite attempts) I'm not sure how to clarify it. I'm happy to post my code if that will make things clearer, but it's a bit longer than your average code sample (not too bad though). Please, comment if further clarification is needed.

Thanks.

Update Re. SHA1: I have tested my algorithm vs a SHA1 on just 2 identical input files (more are expected in the real data), running each 100 times. I realise this is not a thorough test but the results are different enough for it to be worth commenting on.

(The computer wasn't under any other load during either test, and despite what I said in the comments, this wasn't running on a target machine, it was running on one with quite reasonable specs. Both tests had the potential to run in two threads; that is the SHA1 occurred in two threads, and two threads were started for mine but due to the implementation only one would have been used. The single threaded SHA1 version took much longer. Both tests read the same size of chunk at a time. Three sets of results are given.)

Now I'm confused. Are the comments (re. SHA1) right? Therefore this is indicative of a faulty implementation or is something else going on?

SHA1:

real    5m35.865s    6m17.737s    5m57.010s
user    10m18.963s   11m34.178s   10m58.760s
sys     0m47.030s    0m52.707s    0m47.807s

Mine:

real    3m47.185s    4m31.548s    4m40.628s
user    2m47.849s    3m26.207s    3m36.013s
sys     0m59.193s    1m5.139s     1m4.406s

Do you only want to know which files are equal? Or do you need detailed information about the differences? If it si the former, just compare the SHA1 hashes of the files, and you are done. — Sven Marnach, Nov 01 '11 at 20:46
Or another checksum if SHA1 is too slow and you don't need cryptographic security. — agf, Nov 01 '11 at 20:52
@SvenMarnach, All I need is to know if they're equal, but there are likely to be large numbers of big identical files, and I thought doing it this way I could avoid the overhead of calculating the hash for each file (and the, highly unlikely, possibility of a hash collision!) — tjm, Nov 01 '11 at 20:53
@agf: On my machine, computing a SHA1 is at least hundred times faster than reading from disc, so this will never become the bottleneck. — Sven Marnach, Nov 01 '11 at 21:00
@tjm: The cost of computing the SHA1 of a file is completely negligible compared to the cost of reading it from disk. And SHA1 hashes are 160 bits. You will *never* encounter a hash collision. — Sven Marnach, Nov 01 '11 at 21:03
@SvenMarnach But _why not_ use a computationally cheaper checksum? There is no higher cost in time, code, etc. There _is_ a CPU vs. security tradeoff, and there are conceivable situations where your CPU quota was tighter than your disk quota. — agf, Nov 01 '11 at 21:05
@SvenMarnach Yeah, the hash collision bit was a bit of a joke :). I'm just setting up a test to see what the cost of the SHA1 vs the read is here, out of interest. I expect you're right that it will be negligible, but the machines I will be running on do have fast disks but apart from that are, well quite crap! But, assuming I do stick with the code I have, I'm still interested in the answer. — tjm, Nov 01 '11 at 21:11
@agf Again with the non-saving of resources whenever sha1 comes up. See also: http://stackoverflow.com/questions/6963610/how-in-python-check-if-two-files-string-and-file-have-same-content I don't understand this bee in your bonnet. "But why not use a computationally cheaper checksum?" Why not use a better hash when your user is not going to be waiting? Try this: run a sha1 implementation and a hash of your choosing. See if you can tell them apart by time used. — hughdbrown, Nov 02 '11 at 01:08
@SvenMarnach. Please see my update re. SHA1 tests. I'd appreciate your feedback. — tjm, Nov 02 '11 at 03:55
@tjm: You're on linux, somebody probably already wrote a small program that helps you ;-) — Jochen Ritzel, Nov 02 '11 at 04:00
@SvenMarnach. (In addition to my first comment) Also, I forgot to mention, that not taking a hash also allows me to bail out early on files that don't match. — tjm, Nov 02 '11 at 04:25
@SvenMarnach - _"You will never encounter a hash collision."_ - Clearly you've never researched the Merkle-Damgård construction. The same construction is used for MD5, SHA1 and the SHA2 family. Since MD5 is broken with some serious collision vulnerabilities, it casts serious doubt on SHA1's ability to remain collision-free. Whilst it may appear that this point seems moot for such a program, one must keep in mind that a break in SHA1 (which is iminent - i.e. next 5 to 10 years) would break his functionality. I'd suggest using RIPEMD160 if possible, since it doesn't rely on the same construction — Polynomial, Nov 02 '11 at 06:52
@Polynomial: I'm well aware of everything you said. As I understand the OP's problem, she tries to compare some real world files, not files that are specially crafted to produce a hash collision. Even if a collision is found for SHA1, the odds to encounter one by pure chance are still zero (well, actually 7e-49, but that's zero in the given context). — Sven Marnach, Nov 02 '11 at 11:47

score 4 · Answer 1 · answered Nov 01 '11 at 20:46

I suggest you use a binary search methodology to select the size value.

Start with a large value (one that you know to be too large) and reduce it by half. If it's faster, reduce it by half again. If it's slower, go to the next half interval. Continue until you hit the best value.

Python bottlenecks; Determining optimum chunk size for a file comparison function

1 Answers1