0

I want a collision free hash value that is as short as possible. I want to use it as a pretty directory path to a file name.

  • I want to build a directory tree that has an almost equal length path to any file in it.
  • The files have binary content.
  • Two files of identical content should produce identical file paths (I think that is what the hash should provide).
  • Hash length should be minimal.
  • Hash calculation time is NOT the top priority since the hash
    calculation is done once for each file.

My current solution:

import org.apache.commons.codec.binary.Base64;
import org.apache.commons.codec.digest.DigestUtils;

String shortHash(){
  byte[] content = "sample".getBytes();
  byte[] hex = DigestUtils.md5(content);
  String filename = Base64.encodeBase64URLSafeString(hex);
  return filename;
}

It produces the hash value 5e8ff9bf55ba3508199d22e984129be6 and a file name as Xo_5v1W6NQgZnSLphBKb5g

To store many files in a directory tree, I simply split the file name to produce a file path like this:

<basedir>/Xo/_5/v1W6NQgZnSLphBKb5g

How can I produce a shorter file path?

  • Well, for a start, instead of transforming the MD5 digested bytes to hex, then hex to base64, you could simply transform bytes to base64. That would make the result twice as short. – JB Nizet Sep 02 '15 at 22:12
  • Please define *exactly* what you mean by "collision free" here. – Jon Skeet Sep 02 '15 at 22:16
  • @JBNizet Well, that's fine. I'll edit my solution. – Fred Felsbruckner Sep 02 '15 at 22:26
  • @JonSkeet I'm not aware of more conditions for the term "collision free". I want an 1:1 association from file name to content of the file. No or minimal duplicate files with the same content should exist. The use case is a file store where any file dropped produces a distinct file path. – Fred Felsbruckner Sep 02 '15 at 22:36
  • So *either* you can't make the hash purely based on the content of the file - it would have to take account of existing files, *or* the hash must be able to be at least as long as the file (however large that is), by the pigeon-hole principle. In particular, your current solution is going to fail, as any fixed-size hash such as MD5 is bound to have collisions. – Jon Skeet Sep 02 '15 at 23:32
  • Of course, you could choose a hash which is *extremely unlikely* to have collisions, but that's not the same as collision-free. – Jon Skeet Sep 03 '15 at 05:47
  • @JonSkeet Sorting things out is a good strategy to narrow possible solutions. Why is a [hashcode](https://en.wikipedia.org/wiki/Hash_code) not a solution? I thought hashes are made for mapping data of arbitrary size to fixed sized data. Could you elaborate a bit more on your statement or guide me to the concepts I need to understand? May I have to be not so strict in the sense of _collision free_? – Fred Felsbruckner Sep 03 '15 at 17:28
  • Hash codes are used for two distinct reasons: a) in hash tables and the like, to narrow down candidate matches for equality (typically of keys) very quickly, where collisions are *somewhat* rare but thoroughly expected due to the small hash code size (typically 32 bits) and b) in cryptography, where hash collisions are expected to be *extremely* rare, but are still not impossible. Cryptographic hashes are designed to make it incredibly difficult to deliberately achieve a collision. But fundamentally you *can't* map an arbitrary size of data down to a fixed size and guarantee no collisions. – Jon Skeet Sep 03 '15 at 18:03
  • If you want it to be somewhat less strict, e.g. "there is a miniscule chance of a hash collision" then pick a decent cryptographic hash (SHA-256 springs to mind) and use that. That's not "collision free" in the sense that there certainly will be *possible* collisions - it's just incredibly unlikely that you'll actually see them. – Jon Skeet Sep 03 '15 at 18:04
  • @JonSkeet Thanks for clarification. So I have to accept collisions, even if extremely rare. That means two keys can produce the same hash code. I need a strategy on how to handle this case. But that is another question. – Fred Felsbruckner Sep 03 '15 at 19:03

2 Answers2

1

I want a collision free hash value

A hash is never collision free, but you can choose a hash algorith which is extremely unlikely to have collisions, as Jon Skeet explained.

How can I produce a shorter file path?

You need to distiguish two responsibilities.

  1. The hash value for the file gives you fast collision detection.
  2. Produce a short file path. Take a look at alphabet conversion theory and Java UrlShortener implementation.

To handle #2 you follow these steps:

a)Convert

  1. Save real file path in database
  2. You get a unique row ID for that
  3. Convert row ID to short string with encode()
  4. Use the short string as your short file path

b)Resolve

  1. Decode the short file path to an integer ID with decode()
  2. Look up real file path in database for given ID
Community
  • 1
  • 1
JanPl
  • 794
  • 6
  • 19
-1

You can use a Cyclic redundancy check, which generates a value based on the bytes of your content. This is the one I use in Java, which returns a long:

public static long crc64(byte[] data) {
    long crc = 0xffffffffffffffffL;
    for (int b : data) {
        int b2 = (int) (((crc >> 56) & 0xFF) ^ (b & 0xFF));
        crc = (crc << 8) & 0xffffffffffffffffL ^ CRC64_Table[b2];
    }
    return crc;
}

The CRC64_Table is too big to post it here, so I've uploaded it to pastebin.

EDIT: You can also use a 32 bit version, like this one: http://introcs.cs.princeton.edu/java/51data/CRC32.java.html

eric.m
  • 1,577
  • 10
  • 20
  • That's not going to be collision-free. (Nor is any other hash of an arbitrary quantity of data... I don't think the OP's question really makes sense - but a CRC isn't an answer to it.) – Jon Skeet Sep 03 '15 at 05:46
  • @Emd4600 Thanks for your suggestion. I'm sorry to say that it doesn't help with my goal. I do NOT need code for common algorithms, I can research that myself. But I would like to know which concepts would help me to find a solution. – Fred Felsbruckner Sep 03 '15 at 17:18