10

What kind of hashing algorithm is used in the built-in HASH() function?

I'm ideally looking for a SHA512/SHA256 hash, similar to what the SHA() function offers within the linkedin datafu UDFs for Pig.

Cœur
  • 37,241
  • 25
  • 195
  • 267
user1152532
  • 697
  • 3
  • 7
  • 15
  • You can tell a lot by the return type. Since the HASH() function returns a (32-bit) INT type, you can safely assume it's not SHA512 or SHA256, since those would have 512-bit or 256-bit return types, respectively. – Ian McLaird Jan 17 '14 at 03:30

2 Answers2

22

HASH function (as of Hive 0.11) uses algorithm similar to java.util.List#hashCode.

Its code looks like this:

int hashCode = 0; // Hive HASH uses 0 as the seed, List#hashCode uses 1. I don't know why.
for (Object item: items) {
   hashCode = hashCode * 31 + (item == null ? 0 : item.hashCode());
}

Basically it's a classic hash algorithm as recommended in the book Effective Java. To quote a great man (and a great book):

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

I digress. You can look at the HASH source here.

If you want to use SHAxxx in Hive then you can use Apache DigestUtils class and Hive built-in reflect function (I hope that'll work):

SELECT reflect('org.apache.commons.codec.digest.DigestUtils', 'sha256Hex', 'your_string')
Nigel Tufnel
  • 11,146
  • 4
  • 35
  • 31
  • does it convert into string.. for given int type – user145610 May 12 '16 at 14:21
  • I had a confuse about this; When terms had only two elements, and all the element is equals, in this case HashCode = (0 * 31 + eleRawHash ) * 31 + eleRawHash => 32 * eleRawHash => eleRawHash << 5 ? – leo Apr 04 '19 at 06:48
1

As of Hive 2.1.0 there is a mask_hash function that will hash string values.

For Hive 2.x it uses md5 as the hashing algorithm. This was changed to sha256 for Hive 3.x

sworisbreathing
  • 670
  • 4
  • 16