Hash function in spark

Question

I'm trying to add a column to a dataframe, which will contain hash of another column.

I've found this piece of documentation: https://spark.apache.org/docs/2.3.0/api/sql/index.html#hash
And tried this:

import org.apache.spark.sql.functions._
val df = spark.read.parquet(...)
val withHashedColumn = df.withColumn("hashed", hash($"my_column"))

But what is the hash function used by that hash()? Is that murmur, sha, md5, something else?

The value I get in this column is integer, thus range of values here is probably [-2^(31) ... +2^(31-1)].
Can I get a long value here? Can I get a string hash instead?
How can I specify a concrete hashing algorithm for that?
Can I use a custom hash function?

One of the wonders of _open source_ is that you can look at the [**source**](https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2124) as you can see they use `Murmur3`. There is also another function [`sha2`](https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.functions$@sha2(e:org.apache.spark.sql.Column,numBits:Int):org.apache.spark.sql.Column). — Luis Miguel Mejía Suárez, Dec 05 '18 at 14:44

score 19 · Answer 1 · edited May 21 '21 at 13:02

19

It is Murmur based on the source code:

  /**
   * Calculates the hash code of given columns, and returns the result as an int column.
   *
   * @group misc_funcs
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def hash(cols: Column*): Column = withExpr {
    new Murmur3Hash(cols.map(_.expr))
  }

edited May 21 '21 at 13:02

botchniaque

4,698
3
35
63

answered May 23 '19 at 14:38

Fermat's Little Student

5,549
7
49
70

Galuoises · Answer 2 · 2021-02-04T19:26:22.150

6

If you want a Long hash, in spark 3 there is the xxhash64 function: https://spark.apache.org/docs/3.0.0-preview/api/sql/index.html#xxhash64.

You may want only positive numbers. In this case you may use hash and sum Int.MaxValue as

df.withColumn("hashID", hash($"value").cast(LongType)+Int.MaxValue).show()

edited Feb 04 '21 at 19:26

answered Feb 04 '21 at 19:19

Galuoises

2,630
24
30

Hi if I only want positive numbers, how can I achieve this in Python? – wawawa Aug 22 '21 at 11:32
@Galuoises , can you people provide some more resources where these can be used in spark context how to use in data skewness and other area. – Shasu Oct 01 '22 at 19:53
2

@shasu, sorry but what you are asking is not related to the question of the page. Please open a new stackoverflow question – Galuoises Oct 04 '22 at 13:24

Hash function in spark

2 Answers2

Linked