0

I'm working with Apache Hive and need to be certain of how the built-in hash function works. I found this page that lists hash under the Misc. Functions section. It says that hash has been available "As of Hive 0.4".

I would just like to see some documentation on what it's doing exactly. Is it deterministic? Will it always produce the same output given the same input? How many collisions should I expect?

matthiasdenu
  • 323
  • 4
  • 18

1 Answers1

0

A hash function is deterministic, by definition, cf. https://en.wikipedia.org/wiki/Hash_function#Determinism
So if the implementation of hash() was not deterministic, then it would be a bug, and someone would have noticed!

Caveat: that implementation is subject to change (and bug fixes) hence determinism stands only for a given version of Hive.


Hive is Open Source. Documentation is not bad by Apache standards, but still incomplete. Just inspect the source code => https://github.com/apache/hive

For Hive 2.1 for example:

  • the hash() function (an UDF in Hive jargon) is defined here
  • it just calls ObjectInspectorUtils.getBucketHashCode() which calls ObjectInspectorUtils.hashCode() on each argument, then merges its hash into a global "bucket" hash - as defined here
  • a comment shows that the (crude) hashing method implemented by Hive is derived from String.hashCode()


For alternative hashing functions in Hive, see Calculate hash without using exisiting hash fuction in Hive but the answer basically points to the same documentation page that you already found.
Samson Scharfrichter
  • 8,884
  • 1
  • 17
  • 36