I'm writing a file comparison function. I know of filecmp.cmp
but in my dataset it is expected that a lot of the files will be the same so I thought rather than comparing each potential match with each other it would be better to implement a multi-file comparison that can compare them all at once. (Also, since I'm new to python I thought it was a good learning exercise.) It seems to be going O.K. so far, in fact with some inputs it is faster then unix's cmp
(which actually has me a little worried, because I don't quite believe that's possible and therefore think there maybe something wrong with my implementation!)
So, I have the code written, but I'm now trying to determine what the ideal chunk size for each read would be. Part of me is thinking that the data retrieved will all have to be compared anyway so, the more I can get into memory at one time the better, but I wonder if there are limitations of the python data structures that may influence this above that. For example, I'm maintaing, potentially large, lists of chunks and using dictionaries where the keys are the read chunks.
So, what should I be aware of in the python built-in data structures that may affect this, or is this something that will only be determined by hardware and should be determined by profiling on a particular machine?
Reading that back I realise it's not the most clear question, but (despite attempts) I'm not sure how to clarify it. I'm happy to post my code if that will make things clearer, but it's a bit longer than your average code sample (not too bad though). Please, comment if further clarification is needed.
Thanks.
Update Re. SHA1: I have tested my algorithm vs a SHA1 on just 2 identical input files (more are expected in the real data), running each 100 times. I realise this is not a thorough test but the results are different enough for it to be worth commenting on.
(The computer wasn't under any other load during either test, and despite what I said in the comments, this wasn't running on a target machine, it was running on one with quite reasonable specs. Both tests had the potential to run in two threads; that is the SHA1 occurred in two threads, and two threads were started for mine but due to the implementation only one would have been used. The single threaded SHA1 version took much longer. Both tests read the same size of chunk at a time. Three sets of results are given.)
Now I'm confused. Are the comments (re. SHA1) right? Therefore this is indicative of a faulty implementation or is something else going on?
SHA1:
real 5m35.865s 6m17.737s 5m57.010s
user 10m18.963s 11m34.178s 10m58.760s
sys 0m47.030s 0m52.707s 0m47.807s
Mine:
real 3m47.185s 4m31.548s 4m40.628s
user 2m47.849s 3m26.207s 3m36.013s
sys 0m59.193s 1m5.139s 1m4.406s