How to check if large file has changed?

Question

I'm running a script that check if new files are available or files have changed.

root/
├── Sub1
│   ├── Sub1.iso
│   └── Sub1.txt
├── Sub2
│   ├── Sub2.iso
│   ├── Sub2.txt
└── Sub3
    └── Sub3.iso

When a file is new item.txt will be created.
When a file have changed the item.txt shall be recreated.

Created and modified Timestamps are not effective enough since a file could be copy / pasted or whatever but be still the same file.

My idea would be a md5 hash. But the files could be potentially up to 50 gb each. This would take way too much time to compare.

The usual workflow would be:

loop over all subfolders of root
compare size and when neccessary hash of .iso with an existing database entry
create a .txt if file is new / updated
save / update hash in a database filename | hash

Okay since a hash of the complete .iso would take too much time and timestamps are not effective enough:

What other approaches are there to check if a file is changed / updated?

Notes: It have to be OS unindependent and should be viable in python 2.7
I thought about just reading the first 100 blocks or something like that.

use sets, check set.difference for new or deleted files, check size for file changes — Padraic Cunningham, Jun 08 '14 at 17:31
File size and sets (implicit) are already mentioned in the question — boop, Jun 08 '14 at 17:32
Have you timed how long it takes to compute the hash of one of the large files? If not, don't just assume that it would take too long. — Tim, Jun 08 '14 at 17:35
Without making any assumptions, I don't think there's a universal way to detect file changes other than reading every byte of it. Imagine your hard disk fails and a single byte gets corrupted. — Pavel, Jun 08 '14 at 17:37
stores file names in a set and pickle . Load the pickle object on next iteration, compare pickled_set.difference(new_set). Any name changes or deleted files can be found using this logic. Sets are efficient for lookups. — Padraic Cunningham, Jun 08 '14 at 17:38
If you only need to implement this for a specific file type, e.g. `iso`, you can check if there is a header so that you can check only a small portion of the file for changes. Although this doesn't guarantee you that there's not a different iso file with the same header but with different content. — Pavel, Jun 08 '14 at 17:39
Anyway, consider using CRC32 instead of MD5, hash a fixed part of the file (e.g. first and last 100M) and read this: http://stackoverflow.com/questions/1177607/what-is-the-fastest-way-to-create-a-checksum-for-large-files-in-c-sharp — Pavel, Jun 08 '14 at 17:41
Tim yep, too long. @Pavel I like the header idea, this should be good enough actually. — boop, Jun 08 '14 at 17:45

How to check if large file has changed?

0 Answers0