Converting Pandas into Pyspark

Question

So I'm trying to convert my python algorithm to Spark friendly code and I'm having trouble with this one:

indexer = recordlinkage.SortedNeighbourhoodIndex \
          (left_on=column1, right_on=column2, window=41)       

pairs = indexer.index(df_1,df_2)

It basically compares one column against the other and generates index pairs for those likely to be the same (Record Matching).

My code:

df1 = spark.read.load(*.csv)
df2 = spark.read.load(*.csv)

func_udf = udf(index.indexer) ????
df = df.withColumn('column1',func_udf(df1.column1,df2.column2)) ???

I've been using udf for transformations involving just one dataframe and one column, but how do I run a function that requires two arguments, one column from one dataframe and other from other dataframe? I can't join both dataframes as they have different lengths.

score 1 · Accepted Answer · answered Jul 25 '18 at 22:37

That's not how udf work. UserDefinedFunctions can operate only on data that comes from a single DataFrame

Standard udf on data from a single row.
pandas_udf on data from a single partition or single group.

I can't join both dataframes as they have different lengths.

Join is exactly what you should do (standard or manual broadcast). There is no need for objects to be of the same length - Spark join is a relational join not row-wise merge.

For similarity joins you can use built-in approx join tools:

Converting Pandas into Pyspark

1 Answers1