Hive vs Spark Hash function produces different results

Question

I have two jobs that do exactly the same. One is in Hive and the other in Spark. The only difference in the results is that one of the columns is a string that is hashed. So, The results are different in hive and Spark when calling hash().

I do understand that different libraries are used. but I was wondering (if possible) how could Spark be configured to produce the same results as in hive?

Is it possible to figure out the hashing function (e.g. murmur3) and use it in both engines?

Perhaps it's possible to create a Spark udf to produce the same result as the hive hash() function?

Alessandro · Answer 1 · 2018-08-30T11:44:26.453

I have the same problem. What I could find is that hash in hive uses a java function:

Reproduce hive hash function in Python

On the other hand this is the implementation of the hash function in spark:

def hash(*cols):
    """Calculates the hash code of given columns, and returns the result as an int column.

    >>> spark.createDataFrame([('ABC',)], ['a']).select(hash('a').alias('hash')).collect()
    [Row(hash=-757602832)]
    """
    sc = SparkContext._active_spark_context
    jc = sc._jvm.functions.hash(_to_seq(sc, cols, _to_java_column))
    return Column(jc)

However laso the spark implementation is based on hashCode. The problem with hashCode is that it is not deterministic, meaning it is dependent on the JVM and system where it is used. For this reason while the implementation might be correct two string hashed in hive and spark might give different results.

Java, Object.hashCode() result constant across all JVMs/Systems?

score 1 · Answer 2 · answered Oct 26 '19 at 11:38

1

Ineed they use different hash functions.

Hive: What kind of hash algorithm is used for Hive's built-in HASH() Function

Spark uses murmur3hash

https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2264

answered Oct 26 '19 at 11:38

colinfang

20,909
19
90
173

Hive vs Spark Hash function produces different results

2 Answers2