I'm running a script that check if new files are available or files have changed.
root/
├── Sub1
│ ├── Sub1.iso
│ └── Sub1.txt
├── Sub2
│ ├── Sub2.iso
│ ├── Sub2.txt
└── Sub3
└── Sub3.iso
When a file is new item.txt
will be created.
When a file have changed the item.txt
shall be recreated.
Created and modified Timestamps are not effective enough since a file could be copy / pasted or whatever but be still the same file.
My idea would be a md5 hash. But the files could be potentially up to 50 gb each. This would take way too much time to compare.
The usual workflow would be:
- loop over all subfolders of
root
- compare size and when neccessary hash of
.iso
with an existing database entry - create a .txt if file is new / updated
- save / update hash in a database
filename | hash
Okay since a hash of the complete .iso
would take too much time and timestamps are not effective enough:
What other approaches are there to check if a file is changed / updated?
Notes: It have to be OS unindependent and should be viable in python 2.7
I thought about just reading the first 100 blocks or something like that.