Alias for long string

Question

Possible Duplicate:
How to make unique short URL with Python?

I'm looking for a way to essentially shorten a path to a file on disk to a fixed length string, so that I can access it either via it's absolute path, or via this alias.

I have been looking into using UUID as keys for a dictionary with all paths that have an alias, but I've found them too long and would like it to be be between 5-10 characters. I have also been looking at a hashes and thought about hashing the actual path into some useful string I could use as an alias directly and then storing the values in a table on disk. I'm very fresh in the area of hashing, but as I understand it, the key could then be acquired from simply rehashing the path and then entering the key into the table would give me the value without requiring it to be loaded fully into memory or read fully from disk.

End goal is to, in my custom browser, be able point to the same file using:

"/root/folder1/folder2/folder3/file.png" and e.g. "MTEzNDUy"

Possible dictionary would look something like this, note the fixed length keys.

{"MSFjak5m": "/root/folder1/folder2/file.png",
"sofkAkfg": "/root/file.exe",
"ASg5OFA3": "/root/file2.so",
"fFAgeEGH": "/root/file5.so"}

Having a lookup table on disk is acceptable, but what would be even better is if I could simply compress the path into such an alias. Best solution would be for the table to be able to use the hash directly to lookup a value as opposed to having to store key/value pairs as it seems that would mean I'd be doing one hash to get the alias, and then the dictionary with perform yet another hash based on that key to find the value.. please correct me if I'm wrong.

The number of entries would be around 100 000 and all operations would preferably be kept under Python.

Thanks

Edit
Performed a few tests with encoding an MD5 hash and using part of the result as a key. I found that using the first 4 characters gives me a collision rate of about 1 per every 600 entry. Using the first 5 gave me a collision rate of 1/40 000.

These entries will be created one at a time with a rate of about 5/day under normal operation, and a maximum rate 100/day during peak hours, never going above a maximum of 1 000 000 entries.

Considering this, I'm most likely going to check the uniqueness of the hash I get by comparing it to what has already been stored and simply deal with it by either, A: Warn the user that the path cannot be created and that he must choose another name, or B: Increase the number of chars allowed in the hash until a unique hash is found. Either of those seem acceptable at this point.

(Sidenote, is checking a hash against a stored hashtable defeating the purpose of using a hash function?)

Code for the test on windows. Testing only against folders, I've got about 50 000 on my drive.

import hashlib
from random import shuffle

def shuffle_string(word):
    word = list(word)
    shuffle(word)
    return ''.join(word)

tests = 10
chars = 5
_entries = 0
_hashes = {}
for test in xrange(tests):
    for path, _d, _f in os.walk('c:/'):

        unique_path = "%s%i" % (path, test)
        key = hashlib.md5(unique_path).digest().encode('base64').strip()[:chars]
        _hashes[key] = unique_path
        _entries += 1

total_collisions = _entries-len(_hashes)

print "%s Entries \nTests: %s\nChars: %s" % (_entries, tests, chars)
if total_collisions:
    average_collisions = total_collisions / float(tests)
    odds = _entries / float(average_collisions)
    print "%s collisions per %s entries" % (average_collisions, _entries)
    print "odds: 1 in %s" % odds

    if odds: 
        print "chance: %s%%" % (1 / (_entries / float(average_collisions)))
else:
    print "No collisions occured"

Are you aware of the pigeonhole principle? – Nov 28 '12 at 12:02 — , Nov 28 '12 at 12:02
I was not, but I get what you mean. Well put! – Marcus Ottosson Nov 28 '12 at 13:31 — Marcus Ottosson, Nov 28 '12 at 13:31

score 1 · Accepted Answer · answered Nov 28 '12 at 12:03

1

Consider using hashlib standard module to calculate a hash of the string and store a pair of {hash: string} into your dict.

answered Nov 28 '12 at 12:03

Michael Pankov

3,581
2
23
31

Forgive me for not fully understanding, but I've tried the available algorithms inside of hashlib and none of them are to the length I'm looking for, with MD5 hexdigest coming in at around 32 chars and base64 encoded equivalent at around 24 when hashing: "C:\dropbox\storage\projects\beast\jobs\default\database\asset\characters" – Marcus Ottosson Nov 28 '12 at 13:39
It will always have the same length, specified by parameters of hashing algorithm. The algorithm you use may have options for the length of output hash. – Michael Pankov Nov 28 '12 at 15:40
Also, you could try `hash & 0xFFFFFFFF` to get a 4-byte hex hash, although I'm not sure if it still will be collision-free after truncation (probably not). – Michael Pankov Nov 28 '12 at 15:45

Alias for long string

1 Answers1