3

I have two jobs that do exactly the same. One is in Hive and the other in Spark. The only difference in the results is that one of the columns is a string that is hashed. So, The results are different in hive and Spark when calling hash().

I do understand that different libraries are used. but I was wondering (if possible) how could Spark be configured to produce the same results as in hive?

Is it possible to figure out the hashing function (e.g. murmur3) and use it in both engines?

Perhaps it's possible to create a Spark udf to produce the same result as the hive hash() function?

Lou_Ds
  • 531
  • 2
  • 11
  • 23

2 Answers2

2

I have the same problem. What I could find is that hash in hive uses a java function:

Reproduce hive hash function in Python

On the other hand this is the implementation of the hash function in spark:

def hash(*cols):
    """Calculates the hash code of given columns, and returns the result as an int column.

    >>> spark.createDataFrame([('ABC',)], ['a']).select(hash('a').alias('hash')).collect()
    [Row(hash=-757602832)]
    """
    sc = SparkContext._active_spark_context
    jc = sc._jvm.functions.hash(_to_seq(sc, cols, _to_java_column))
    return Column(jc)

However laso the spark implementation is based on hashCode. The problem with hashCode is that it is not deterministic, meaning it is dependent on the JVM and system where it is used. For this reason while the implementation might be correct two string hashed in hive and spark might give different results.

Java, Object.hashCode() result constant across all JVMs/Systems?

Alessandro
  • 845
  • 11
  • 21