This response stems mainly from viewing a similar question in StackOverflow here. In their example, they investigate how null values can be dealt with when running a jellyfish string comparison.
You'll want to set up a UDF call to utilize the parallel processing powers of pyspark. See code below:
from pyspark.sql.functions import udf
from pyspark.sql.functions import col
from pyspark.sql.types import DoubleType
import jellyfish
# initiate user defined function (UDF) call.
@udf(DoubleType())
def jaro_winkler(s1, s2):
return jellyfish.jaro_winkler(s1, s2)
# to create a new column
df = df.withColumn('new_column',jaro_winkler(col('column1'),col('column2')))
# to show top 20 results
df.select('new_column').show()
for a similar functionality with the option to deal with null values, I would suggest altering your function to incorporate the following change:
@udf(DoubleType())
def jaro_winkler(s1, s2):
if s1 is None or s2 is None:
out = 0
else:
out = jellyfish.jaro_winkler(s1, s2)
return out