Implementing Hashing with spark

Question

So,I have this implementation of seperate chaining hashing in Java : https://github.com/Big-data-analytics-project/Static-hashing-closed/blob/main/Static%20hashing%20closed

The next step is emplementing it using spark, I tried reading tutorials but I'm still lost. How can I do this ?

score 0 · Answer 1 · answered Nov 04 '20 at 18:53

One possibility is to create a jar from your hashing implementation and register it inside the Spark application as UDF like this:

spark.udf.registerJavaFunction("udf_hash", "function_name_inside_jar", <returnType e.g: StringType()>)

after this, you can use it via SQL expression, like this:

df = df.withColumn("hashed_column", expr("udf_hash({})".format("column")))

useful links:

Register UDF to SqlContext from Scala to use in PySpark

Spark: How to map Python with Scala or Java User Defined Functions?

Important you have to define your jar in spark-submit using --jars

Thank you ! The purpose was to implement it using spark, I don't know if i'm allowed to reuse my hashing implementation — Nawel, Nov 05 '20 at 10:00

toofrellik · Answer 2 · 2020-11-05T06:20:51.253

you can use below UDF to get this achived:

   #1.define hash id calculation UDF
    def calculate_hashidUDF = udf((uid: String) => {
      val md = java.security.MessageDigest.getInstance("SHA-1")
      new BigInteger( DatatypeConverter.printHexBinary(md.digest(uid.getBytes)), 16).mod(BigInteger.valueOf(10000))
    })
    #2.register hash id calculation UDF as spark sql function
    spark.udf.register("hashid", calculate_hashidUDF)

for direct hash value use md in above def, this function how ever will return values from 1 to 10000

once you register as spark udf then you can use hashid in spark.sql aswell.

Implementing Hashing with spark

2 Answers2