I want to use some string similarity functions that are not native to pyspark such as the jaro and jaro-winkler measures on dataframes. These are readily available in python modules such as jellyfish
. I can write pyspark udf's fine for cases where there a no null
values present, i.e. comparing cat to dog. when I apply these udf's to data where null
values are present, it doesn't work. In problems such as the one I'm solving it is very common for one of the strings to be null
I need help getting my string similarity udf to work in general, to be more specific, to work in cases where one of the values are null
I wrote a udf that works when there are no null values in the input data:
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
import pyspark.sql.functions as F
import jellyfish.cjellyfish
def jaro_winkler_func(df, column_left, column_right):
jaro_winkler_udf = udf(f=lambda s1, s2: jellyfish.jaro_winkler(s1, s2), returnType=DoubleType())
df = (df
.withColumn('test',
jaro_winkler_udf(df[column_left], df[column_right])))
return df
Example input and output:
+-----------+------------+
|string_left|string_right|
+-----------+------------+
| dude| dud|
| spud| dud|
+-----------+------------+
+-----------+------------+------------------+
|string_left|string_right| test|
+-----------+------------+------------------+
| dude| dud|0.9166666666666666|
| spud| dud|0.7222222222222222|
+-----------+------------+------------------+
When I run this on data that has a null value then I get the usual reams of spark errors, the most applicable one seems to be TypeError: str argument expected
. I assume this one is due to null
values in the data since it worked when there were none.
I modified the function above to to check if both values are not null and only run the function if that's the case, otherwise return 0.
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
import pyspark.sql.functions as F
import jellyfish.cjellyfish
def jaro_winkler_func(df, column_left, column_right):
jaro_winkler_udf = udf(f=lambda s1, s2: jellyfish.jaro_winkler(s1, s2), returnType=DoubleType())
df = (df
.withColumn('test',
F.when(df[column_left].isNotNull() & df[column_right].isNotNull(),
jaro_winkler_udf(df[column_left], df[column_right]))
.otherwise(0.0)))
return df
However, I still get the same errors as before.
Sample input and what I would like the output to be:
+-----------+------------+
|string_left|string_right|
+-----------+------------+
| dude| dud|
| spud| dud|
| spud| null|
| null| null|
+-----------+------------+
+-----------+------------+------------------+
|string_left|string_right| test|
+-----------+------------+------------------+
| dude| dud|0.9166666666666666|
| spud| dud|0.7222222222222222|
| spud| null|0.0 |
| null| null|0.0 |
+-----------+------------+------------------+