0

I have a data frame like this

df = [id1, id2, name1, name2, address1, address2, DOB1, DOB2]

I would like get Jaro_winkler score (in a new column) for the column1 and column2 in the Pyspark DataFrame. I am trying to use jellyfish python package.

Thanks

Ajay Kharade
  • 1,469
  • 1
  • 17
  • 31
Muns
  • 1
  • 1

1 Answers1

2

This response stems mainly from viewing a similar question in StackOverflow here. In their example, they investigate how null values can be dealt with when running a jellyfish string comparison.

You'll want to set up a UDF call to utilize the parallel processing powers of pyspark. See code below:

from pyspark.sql.functions import udf
from pyspark.sql.functions import col
from pyspark.sql.types import DoubleType
import jellyfish

# initiate user defined function (UDF) call.
@udf(DoubleType())
def jaro_winkler(s1, s2):
    return jellyfish.jaro_winkler(s1, s2)

# to create a new column
df = df.withColumn('new_column',jaro_winkler(col('column1'),col('column2')))

# to show top 20 results
df.select('new_column').show()

for a similar functionality with the option to deal with null values, I would suggest altering your function to incorporate the following change:

@udf(DoubleType())
def jaro_winkler(s1, s2):
    if s1 is None or s2 is None:
        out = 0
    else: 
        out = jellyfish.jaro_winkler(s1, s2)
    return out
Dharman
  • 30,962
  • 25
  • 85
  • 135
PJ Gibson
  • 31
  • 4