2

I'm looking for a hash function to hash Strings. For my purposes (identifying changed objects during an import) it should have the following properties:

  1. fast

  2. can be used incremental, i.e. I can use it like this:

    Hasher h = new Hasher();
    h.add("somestring");
    h.add("another part");
    h.add("eveno more");
    Long hash = h.create();
    

    without compromising the other properties or keeping the strings in memory during the complete process.

  3. Secure against collisions. If I compare two hash values from different strings 1 million times per day for the rest of my life, the risk that I get a collision should be neglectable.

It does not have to be secure against malicious attempts to create collisions.

What algorithm can I use? An algorithm with an existent free implementation in Java is preferred.

Clarification

  1. The hash doesn't have to be a long. A String for example would be just fine.

  2. The data to be hashed will come from a file or a db, with many 10MB or up to a few GB of data, that will get distributed into different Hashes. So keeping the complete Strings in memory is not really an option.

Jens Schauder
  • 77,657
  • 34
  • 181
  • 348
  • 2
    See [what-is-a-good-64bit-hash-function-in-java-for-textual-strings](http://stackoverflow.com/questions/1660501/what-is-a-good-64bit-hash-function-in-java-for-textual-strings). – NameSpace Nov 14 '14 at 11:19
  • "Secure against collisions" - then hashing is not what you're looking for. – Durandal Nov 14 '14 at 16:52
  • @Durandal Mind explaining Why? – Jens Schauder Nov 15 '14 at 04:06
  • I guess @Durandal wanted to point out, that every hash-function has collisions. I guess what you want is a function where collisions are unlikely to happen for similar inputs. – TheConstructor Nov 18 '14 at 17:16
  • Let's say you have another 10000 days. That's 10^10 pairs of hash values required to have a negligible probability of being equal - say, less than one ppm - easy with an image large compared to 10^16 like 63 bits or even 64. But adding a million hash codes a day and requiring each to be unique would limit your life expectancy to less than twelve years. – greybeard Nov 18 '14 at 18:22
  • Whats there to explain? Hashing to decrease the amount of information to a smaller set of values will naturally always produce collisions; which is in straight conflict with the explicit requirement quoted. As for the unrelated (not quoted) relaxation of the requirement... it depends on what "neglectable" means; without a solid analysis (of inputs and method) its impossible to determine even the order of magnitude of the collision probability. – Durandal Nov 18 '14 at 18:44
  • @Durandal the question is: how likely is a collision. Git is using this approach, the chances that any of us will see an accidental collision seem to be extraordinary slim. – Jens Schauder Nov 19 '14 at 06:45
  • @JensSchauder That really depends on use-case and method, just "assuming" an N-bit hash givens me a 1 in 2^N collision chance is overoptimistic in many regards. Play with the code I posted, it should provide some insight/points to keep in mind. – Durandal Nov 19 '14 at 16:08

2 Answers2

3

Hashs are a sensible topic and it is hard to recommend any such hash based upon your question. You might want to ask this question on https://security.stackexchange.com/ to get expert opinions on the usability of hashs in certain usecases.

What I understood so far is that most hashs are implemented incrementally in the very core; the execution-timing on the other hand is not that easy to predict.

I present you two Hasher implementations which rely on "an existent free implementation in Java". Both implementations are constructed in a way that you can arbitrarily split your Strings before calling add() and get the same result as long as you do not change the order of the characters in them:

import java.math.BigInteger;
import java.nio.charset.Charset;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.Arrays;

/**
 * Created for https://stackoverflow.com/q/26928529/1266906.
 */
public class Hashs {

    public static class JavaHasher {
        private int hashCode;

        public JavaHasher() {
            hashCode = 0;
        }

        public void add(String value) {
            hashCode = 31 * hashCode + value.hashCode();
        }

        public int create() {
            return hashCode;
        }
    }

    public static class ShaHasher {
        public static final Charset UTF_8 = Charset.forName("UTF-8");
        private final MessageDigest messageDigest;

        public ShaHasher() throws NoSuchAlgorithmException {
            messageDigest = MessageDigest.getInstance("SHA-256");
        }

        public void add(String value) {
            messageDigest.update(value.getBytes(UTF_8));
        }

        public byte[] create() {
            return messageDigest.digest();
        }
    }

    public static void main(String[] args) {
        javaHash();

        try {
            shaHash();
        } catch (NoSuchAlgorithmException e) {
            e.printStackTrace();  // TODO: implement catch
        }
    }

    private static void javaHash() {
        JavaHasher h = new JavaHasher();
        h.add("somestring");
        h.add("another part");
        h.add("eveno more");
        int hash = h.create();
        System.out.println(hash);
    }

    private static void shaHash() throws NoSuchAlgorithmException {
        ShaHasher h = new ShaHasher();
        h.add("somestring");
        h.add("another part");
        h.add("eveno more");
        byte[] hash = h.create();
        System.out.println(Arrays.toString(hash));
        System.out.println(new BigInteger(1, hash));
    }
}

Here obviously "SHA-256" could be replaced with other common hash-algorithms; Java ships quite a few of them.

Now you called out for a Long as return-value which would imply you are looking for a 64bit-Hash. If this really was on purpose have a look at the answers to What is a good 64bit hash function in Java for textual strings?. The accepted answer is a slight variant of the JavaHasher as String.hashCode() does basically the same calculation, but with lower overflow-boundary:

    public static class Java64Hasher {
        private long hashCode;

        public Java64Hasher() {
            hashCode = 1125899906842597L;
        }

        public void add(CharSequence value) {
            final int len = value.length();

            for(int i = 0; i < len; i++) {
                hashCode = 31*hashCode + value.charAt(i);
            }
        }

        public long create() {
            return hashCode;
        }
    }

Unto your points:

  1. fast

    With SHA-256 being slower than the other two I still would call all three presented approaches fast.

  2. can be used incremental without compromising the other properties or keeping the strings in memory during the complete process.

    I can not guarantee that property for the ShaHasher as I understand it is block-based and I lack the source code.Still I would suggest that at most one block, the hash and some internal states are kept. The other two obviously only store the partial hash between calls to add()

  3. Secure against collisions. If I compare two hash values from different strings 1 million times per day for the rest of my life, the risk that I get a collision should be neglectable.

    For every hash there are collisions. Given a good distribution the bit-size of the hash is the main factor on how often a collision happens. The JavaHasher is used in e.g. HashMaps and seems to be "collision-free" enough to distribute similar keys far apart each other. As for any deeper analysis: do your own tests or ask your local security engineer - sorry.

I hope this gives a good starting point, details are probably mainly opinion-based.

Community
  • 1
  • 1
TheConstructor
  • 4,285
  • 1
  • 31
  • 52
  • [Looking at the open JDK sources]( http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8-b132/sun/security/provider/SHA2.java?av=h#SHA2.SHA256) SHA-256 is really implemented to hold at most one block-size of data at any time. – TheConstructor Nov 25 '14 at 12:20
1

Not intended as an answer, just to demonstrate that hash collisions are much more likely than human intuition tends to assume.

The following tiny program generates 2^31 distinct strings and checks if any of their hashes collide. It does this by keeping a tracking bit per possible hash value (so you need >512MB heap to run it), to mark each hash value as "used" as they are encountered. It takes several minutes to complete.

public class TestStringHashCollisions {

    public static void main(String[] argv) {
        long collisions = 0;
        long testcount = 0;
        StringBuilder b = new StringBuilder(64);
        for (int i=0; i>=0; ++i) {
            // construct distinct string
            b.setLength(0);
            b.append("www.");
            b.append(Integer.toBinaryString(i));
            b.append(".com");

            // check for hash collision
            String s = b.toString();
            ++testcount;
            if (isColliding(s.hashCode()))
                ++collisions;

            // progress printing
            if ((i & 0xFFFFFF) == 0) {
                System.out.println("Tested: " + testcount + ", Collisions: " + collisions);
            }
        }
        System.out.println("Tested: " + testcount + ", Collisions: " + collisions);
        System.out.println("Collision ratio: " + (collisions / (double) testcount));
    }

    // storage for 2^32 bits in 2^27 ints
    static int[] bitSet = new int[1 << 27];

    // test if hash code has appeared before, mark hash as "used"
    static boolean isColliding(int hash) {
        int index = hash >>> 5;
        int bitMask = 1 << (hash & 31);
        if ((bitSet[index] & bitMask) != 0)
            return true;
        bitSet[index] |= bitMask;
        return false;
    }

}

You can adjust the string generation part easily to test different patterns.

Durandal
  • 19,919
  • 4
  • 36
  • 70