1

I'm running a script that check if new files are available or files have changed.

root/
├── Sub1
│   ├── Sub1.iso
│   └── Sub1.txt
├── Sub2
│   ├── Sub2.iso
│   ├── Sub2.txt
└── Sub3
    └── Sub3.iso

When a file is new item.txt will be created.
When a file have changed the item.txt shall be recreated.

Created and modified Timestamps are not effective enough since a file could be copy / pasted or whatever but be still the same file.

My idea would be a md5 hash. But the files could be potentially up to 50 gb each. This would take way too much time to compare.

The usual workflow would be:

  • loop over all subfolders of root
  • compare size and when neccessary hash of .iso with an existing database entry
  • create a .txt if file is new / updated
  • save / update hash in a database filename | hash

Okay since a hash of the complete .iso would take too much time and timestamps are not effective enough:

What other approaches are there to check if a file is changed / updated?

Notes: It have to be OS unindependent and should be viable in python 2.7
I thought about just reading the first 100 blocks or something like that.

boop
  • 7,413
  • 13
  • 50
  • 94
  • use sets, check set.difference for new or deleted files, check size for file changes – Padraic Cunningham Jun 08 '14 at 17:31
  • File size and sets (implicit) are already mentioned in the question – boop Jun 08 '14 at 17:32
  • Have you timed how long it takes to compute the hash of one of the large files? If not, don't just assume that it would take too long. – Tim Jun 08 '14 at 17:35
  • Without making any assumptions, I don't think there's a universal way to detect file changes other than reading every byte of it. Imagine your hard disk fails and a single byte gets corrupted. – Pavel Jun 08 '14 at 17:37
  • stores file names in a set and pickle . Load the pickle object on next iteration, compare pickled_set.difference(new_set). Any name changes or deleted files can be found using this logic. Sets are efficient for lookups. – Padraic Cunningham Jun 08 '14 at 17:38
  • If you only need to implement this for a specific file type, e.g. `iso`, you can check if there is a header so that you can check only a small portion of the file for changes. Although this doesn't guarantee you that there's not a different iso file with the same header but with different content. – Pavel Jun 08 '14 at 17:39
  • 1
    Anyway, consider using CRC32 instead of MD5, hash a fixed part of the file (e.g. first and last 100M) and read this: http://stackoverflow.com/questions/1177607/what-is-the-fastest-way-to-create-a-checksum-for-large-files-in-c-sharp – Pavel Jun 08 '14 at 17:41
  • Tim yep, too long. @Pavel I like the header idea, this should be good enough actually. – boop Jun 08 '14 at 17:45

0 Answers0