1

I have a Java application which works with MySQL database.

I want to be able to store long texts and check whether table contains them. For this I want to use index, and search by reduced "hash" of full_text.

MY_TABLE [
    full_text: TEXT
    text_hash: varchar(255) - indexed
]

Thing is, I cannot use String.hashCode() as:

  1. Implementation may vary across JVM versions.
  2. Value is too short, which means many collisions.

I want to find a fast hashing function that will read the long text value and produce a long hash value for it, say 64 symbols long.

snowindy
  • 3,117
  • 9
  • 40
  • 54

2 Answers2

4

Such reliable hash methods are not fast. They're probably fast enough, though. You're looking for a cryptographic message digest method (like the ones used to identify files in P2P networks or commits in Git). Look for the MessageDigest class, and pick your algorithm (SHA1, MD5, SHA256, etc.).

Such a hash function will take bytes as argument, and produce bytes as a result, so make sure to convert your strings using a constant encoding (UTF8, for example), and to transform the produced byte array (typically of 16 or 20 bytes) to a readable String using hexadecimal or Base64 encoding.

JB Nizet
  • 678,734
  • 91
  • 1,224
  • 1,255
1

I'd suggest that you to revisit String.hashCode().

First, it does not vary across implementations. The exact hash is specified; see the String.hashCode javadoc specification.

Second, while the String hash algorithm isn't the best there possibly is (and certainly it will have more collisions than a cryptographic hash) it does do a reasonably good job of spreading the hashes over the 32-bit result space. For example, I did a quick check of a text file on my machine (/usr/share/dict/web2a) which has 235,880 words, and there were six collisions.

Third and fourth: String.hashCode() should be considerably faster, and the storage required for the hash values should be considerably smaller, than a cryptographic hash.

If you're storing strings in a database table, and their hash values are indexed, having a few collisions shouldn't matter. Looking up a string should get you the right database rows really quickly, and having to (maybe) check a couple actual strings should be very fast compared to the database I/O.

Stuart Marks
  • 127,867
  • 37
  • 205
  • 259