I'm trying to create a custom join for two dataframes (df1 and df2) in PySpark (similar to this), with code that looks like this:
my_join_udf = udf(lambda x, y: isJoin(x, y), BooleanType())
my_join_df = df1.join(df2, my_join_udf(df1.col_a, df2.col_b))
The error message I'm getting is:
java.lang.RuntimeException: Invalid PythonUDF PythonUDF#<lambda>(col_a#17,col_b#0), requires attributes from more than one child
Is there a way to write a PySpark UDF that can process columns from two separate dataframes?