0

I am working on a task getting the similarity score of the name related data. I am using Spark and jellyfish function in Python. Below is my code in a class:

import jellyfish
import pyspark.sql.functions as F
from pyspark.sql import SparkSession, DataFrame
from pyspark import SparkContext

df = self.jaro_winkler_func(df, 'df1.first_name', 'df2.first_name')

def jaro_winkler_score(self, s1, s2):
    if s1 is None or s2 is None:
        out = 0
    else:
        out = jellyfish.jaro_winkler(s1, s2)

    return out

def jaro_winkler_func(self, df, column_left, column_right):
    df = df.withColumn('test', self.jaro_winkler_score(df[column_left], df[column_right]))

    return df

Below is the error I got:

out = jellyfish.jaro_winkler(s1, s2)
TypeError: str argument expected

I see other related posts in below for same issue but above functions used are already borrowing the answers from these posts.

Creating score column in Pyspark data frame using jellyfish package

Pyspark: How to deal with null values in python user defined functions

I am using Spark 2.3.

Please suggest and thanks in advance.

MAMS
  • 419
  • 1
  • 6
  • 17
  • 1
    You are not passing `str`, did you define `jaro_winkler_score` as a `UDF`? The first link from the posts you've provided explains it. – vladsiv Nov 18 '21 at 11:41
  • Thanks for pointing this out. It worked after defining jaro_winkler_score as a UDF. – MAMS Nov 18 '21 at 14:05

0 Answers0