I want to calculate hash for strings in hive without writing any UDF only using exisiting functions . So that I can use similar approach to get consistent hash in other languages. for ex : are there any functions using which I can do something like adding characters or taking Xor.
-
Your title says *"without using exisiting hash fuction"* but your question says *"only using exisiting functions"* which is the exact opposite. What do you want, actually?? – Samson Scharfrichter Feb 09 '17 at 15:27
-
You'll have to be more specific regarding the Hive version you are using and the other languages you are referring to – David דודו Markovitz Feb 09 '17 at 20:54
-
"So that I can use similar approach to get consistent hash in other languages" So If I use existing hash function it wont be similar if I calculate it in some other language . So I want to calculate simplest hash possible using "other" existing functions which I will be able to replicate in other languages also. For Ex :I want to bucketize strings so I can do ASCII("abc")%NoofBuckets it will give me ascii of first character , but distribution across buckets wont be that good . So I need something which is more reasonable that. – Amit Kumar Feb 10 '17 at 07:14
1 Answers
It depends on the version of Hive, cf. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Misc.Functions
select XYZ, hash(XYZ) from ABC
has been available for years and applies plain old java.lang.String.hashCode()
, returning an INT (32 bit hash)
[Edit 2] Actually it's a bit more complex since hash()
accepts a list of arguments of any type (incl. primitive types that have no built-in hashing method), so a custom approach is used -- check ObjectInspectorUtils.hashCode()
and ObjectInspectorUtils.getBucketHashCode()
in the source code here (for V2.1)
select XYZ, crc32(XYZ) from ABC
requires Hive 1.3 and applies plain old Cyclic Redundancy Check (probably via java.util.zip.CRC32
), returning a BIGINT (32 bit hash)
select XYZ, md5(XYZ), sha1(XYZ), sha2(XYZ,256), sha2(XYZ,512) from ABC
requires Hive 1.3 and applies strong, cryptographic hash functions, returning a STRING with the hexadecimal representation of the binary (128, 160, 256 and 512 bit hashes)
[Edit 1] the answer to that post has also a very good workaround for applying crypto hash functions with older versions of Hive, using Apache Commons static methods and reflect()
.

- 8,884
- 1
- 17
- 36