I have two jobs that do exactly the same.
One is in Hive
and the other in Spark
. The only difference in the results is that one of the columns is a string that is hashed. So, The results are different in hive and Spark when calling hash()
.
I do understand that different libraries are used. but I was wondering (if possible) how could Spark be configured to produce the same results as in hive?
Is it possible to figure out the hashing function (e.g. murmur3
) and use it in both engines?
Perhaps it's possible to create a Spark udf
to produce the same result as the hive hash()
function?