I am working with a lot of very large files (e.g. 1000 x 512MB) and I implemented a way to speed up things by writing certain information into databases which can be accessed when I re-run the software. For this reason I need to be able to generate unique filenames for arbitrary subsets of these files.
I have tried to generate file names based on the total file size in the subset and the file modification date and even combinations of them. The problem is that many files have the same size and the same modification date, which makes my current identifier string ambiguous. Important is only that for the same list of files, the identifier is always the same so that I can always access the correct file for the same files. Any ideas are greatly appreciated!
Here is what I use at the moment, which does not work...
import os
import glob
import datetime
file_paths = glob.glob("path/to/files/*.foo")
def modification_date(file_path):
return datetime.datetime.fromtimestamp(os.path.getmtime(filename=file_path))
uid = [modification_date(f) for f in file_paths]
uid = [d.year + d.day + d.day + d.hour + d.minute + d.second for d in uid]
uid = sum(uid) // len(uid) + sum([os.path.getsize(f) for f in file_paths])