1

I am working with a lot of very large files (e.g. 1000 x 512MB) and I implemented a way to speed up things by writing certain information into databases which can be accessed when I re-run the software. For this reason I need to be able to generate unique filenames for arbitrary subsets of these files.

I have tried to generate file names based on the total file size in the subset and the file modification date and even combinations of them. The problem is that many files have the same size and the same modification date, which makes my current identifier string ambiguous. Important is only that for the same list of files, the identifier is always the same so that I can always access the correct file for the same files. Any ideas are greatly appreciated!

Here is what I use at the moment, which does not work...

import os
import glob
import datetime

file_paths = glob.glob("path/to/files/*.foo")

def modification_date(file_path):
    return datetime.datetime.fromtimestamp(os.path.getmtime(filename=file_path))

uid = [modification_date(f) for f in file_paths]
uid = [d.year + d.day + d.day + d.hour + d.minute + d.second for d in uid]
uid = sum(uid) // len(uid) + sum([os.path.getsize(f) for f in file_paths])
HansSnah
  • 2,160
  • 4
  • 18
  • 31
  • 1
    If all the metadata (size, modification date) are the same, you need to use the contents of the file. A simple md5 hash might do. – Selcuk Jan 29 '16 at 14:12
  • Do these files have names? Can you base your subset name on a combination of those file names? ex. The subset consisting of "foo.txt" and "bar.exe" is named `"(foo.txt)(bar.exe)"`. – Kevin Jan 29 '16 at 14:15
  • @Kevin I also thought about using the filenames, but they are quite long and I would end up with a very long unique identifier. Maybe there is a quick way to convert all filenames into a sequence of numbers...the filenames contain all sorts of letters and symbols though... – HansSnah Jan 29 '16 at 14:18
  • @Selcuk Wouldn't an md5 hash take a long time to generate for such large files? That's why I avoided it in the first place. It has to be quick! – HansSnah Jan 29 '16 at 14:19
  • can you generate unique no repeative random filename with tempfile ? – biobirdman Jan 29 '16 at 14:19
  • @biobirdman I am not familiar with temple, but I doubt I can produce reproducible unique filenames, right? Would be great if you prove me wrong... – HansSnah Jan 29 '16 at 14:21
  • 1
    I think you're going to end up with a long unique identifier no matter what you do. There are 2^1000 different possible subsets of 1000 files, so by the pigeonhole principle at least some of them will have identifiers that are log(2^1000, 128) = 143 characters long, assuming your filenames can contain any of the 128 ASCII characters. – Kevin Jan 29 '16 at 14:22
  • http://stackoverflow.com/questions/10501247/best-way-to-generate-random-file-names-in-python uuid module migth be what you need – biobirdman Jan 29 '16 at 14:23
  • @biobirdman But I dont see a way to make the unique identifier reproducible with uuid...hmmm – HansSnah Jan 29 '16 at 14:28

2 Answers2

1

This may well be wildly out but as long as the files included in the list doesn't change, could you get what you are after with hashlib?

import hashlib, glob
file_paths = glob.glob("/home/rolf/StackOverflow/*.py")
hash_object = hashlib.sha256(str(file_paths))
file_name= hash_object.hexdigest()
file_name
'f69dd9583eabae55319e9a56dbfc737cc16ab58634d9042f4c530e9a85e7d77c'

file_paths = glob.glob("/home/rolf/Research/*.py")
hash_object = hashlib.sha256(str(file_paths))
file_name= hash_object.hexdigest()
file_name
'dc4beec8e859b190d966068a52674d135f22520514ac1020b3e11a4af090e579'
Rolf of Saxony
  • 21,661
  • 5
  • 39
  • 60
  • 1
    Hmmm. This works, but how can I be sure that this is unique for all potential combinations? – HansSnah Jan 29 '16 at 19:45
  • As @Michael Mullany states in the attached link the probability of a collision among two inputs is purely dependent on the count of inputs and is 1 - (2^256)! / ((2^256^inputcount) * (2^256-inputcount)!), basically zero for a reasonable number of inputs. http://stackoverflow.com/questions/2299846/are-there-circumstances-where-a-hash-algorithm-can-be-guaranteed-unique – Rolf of Saxony Jan 30 '16 at 10:06
0

If the master list of all files never changes, then you can uniquely encode any subset of that list into a string of about len(master_list)/8 characters. First, you create a bit array where a 0 at index i means "the ith element of the master list is not part of this subset" and a 1 means "the ith element of the master list is part of this subset". Then you convert the bit array into an integer, which you can then pack into a string.

import struct
import base64

#encodes a positive integer of any size into a string.
def string_from_int(value):
    bytes = []
    while value > 0:
        bytes.append(value % 2**8)
        value >>= 8
    return struct.pack("i"*len(bytes), *bytes)

#decodes a string into a positive integer. Only works on strings whose length is divisible by 4.
def int_from_string(s):
    bytes = struct.unpack("i" * (len(s)/4), s)
    return sum(byte * (256**i) for i, byte in enumerate(bytes))

#encodes a subset of a list into a filename-safe string.
def encode(master_file_list, subset):
    bitmask = [int(filename in subset) for filename in master_file_list]
    int_value = sum(bit*2**(i) for i, bit in enumerate(bitmask))
    string_value = string_from_int(int_value)
    #the string may contain slashes or other characters illegal in filenames, so we b64 endoce it, at the expense of increasing the length by %20 or so
    return base64.urlsafe_b64encode(string_value)

#the reverse of `encode`.
def decode(master_file_list, filename):
    string_value = base64.urlsafe_b64decode(filename)
    int_value = int_from_string(string_value)
    subset = [value for i,value in enumerate(master_file_list) if (2**i) & int_value]
    return subset

master_file_list = ['caddy.jpg', 'fjeld.dat', 'gyve.ini', 'karts.png', 'laves.txt', 'nimbi.jpg', 'ocas.ini', 'sipes.dat', 'wacky.png', 'waff.png']
subset = ["fjeld.dat", "laves.txt", "ocas.ini"]
filename = encode(master_file_list, subset)
print "Subset is:", subset
print "Filename is:", filename
print "Filename decodes back to:", decode(master_file_list, filename)

Result:

Subset is: ['fjeld.dat', 'laves.txt', 'ocas.ini']
Filename is: UgAAAA==
Filename decodes back to: ['fjeld.dat', 'laves.txt', 'ocas.ini']
Kevin
  • 74,910
  • 12
  • 133
  • 166