Possible Duplicate:
How to make unique short URL with Python?
I'm looking for a way to essentially shorten a path to a file on disk to a fixed length string, so that I can access it either via it's absolute path, or via this alias.
I have been looking into using UUID as keys for a dictionary with all paths that have an alias, but I've found them too long and would like it to be be between 5-10 characters. I have also been looking at a hashes and thought about hashing the actual path into some useful string I could use as an alias directly and then storing the values in a table on disk. I'm very fresh in the area of hashing, but as I understand it, the key could then be acquired from simply rehashing the path and then entering the key into the table would give me the value without requiring it to be loaded fully into memory or read fully from disk.
End goal is to, in my custom browser, be able point to the same file using:
"/root/folder1/folder2/folder3/file.png" and e.g. "MTEzNDUy"
Possible dictionary would look something like this, note the fixed length keys.
{"MSFjak5m": "/root/folder1/folder2/file.png",
"sofkAkfg": "/root/file.exe",
"ASg5OFA3": "/root/file2.so",
"fFAgeEGH": "/root/file5.so"}
Having a lookup table on disk is acceptable, but what would be even better is if I could simply compress the path into such an alias. Best solution would be for the table to be able to use the hash directly to lookup a value as opposed to having to store key/value pairs as it seems that would mean I'd be doing one hash to get the alias, and then the dictionary with perform yet another hash based on that key to find the value.. please correct me if I'm wrong.
The number of entries would be around 100 000 and all operations would preferably be kept under Python.
Thanks
Edit
Performed a few tests with encoding an MD5 hash and using part of the result as a key. I found that using the first 4 characters gives me a collision rate of about 1 per every 600 entry. Using the first 5 gave me a collision rate of 1/40 000.
These entries will be created one at a time with a rate of about 5/day under normal operation, and a maximum rate 100/day during peak hours, never going above a maximum of 1 000 000 entries.
Considering this, I'm most likely going to check the uniqueness of the hash I get by comparing it to what has already been stored and simply deal with it by either, A: Warn the user that the path cannot be created and that he must choose another name, or B: Increase the number of chars allowed in the hash until a unique hash is found. Either of those seem acceptable at this point.
(Sidenote, is checking a hash against a stored hashtable defeating the purpose of using a hash function?)
Code for the test on windows. Testing only against folders, I've got about 50 000 on my drive.
import hashlib
from random import shuffle
def shuffle_string(word):
word = list(word)
shuffle(word)
return ''.join(word)
tests = 10
chars = 5
_entries = 0
_hashes = {}
for test in xrange(tests):
for path, _d, _f in os.walk('c:/'):
unique_path = "%s%i" % (path, test)
key = hashlib.md5(unique_path).digest().encode('base64').strip()[:chars]
_hashes[key] = unique_path
_entries += 1
total_collisions = _entries-len(_hashes)
print "%s Entries \nTests: %s\nChars: %s" % (_entries, tests, chars)
if total_collisions:
average_collisions = total_collisions / float(tests)
odds = _entries / float(average_collisions)
print "%s collisions per %s entries" % (average_collisions, _entries)
print "odds: 1 in %s" % odds
if odds:
print "chance: %s%%" % (1 / (_entries / float(average_collisions)))
else:
print "No collisions occured"