I am not doing LeftSemi join anywhere, neither am I using a python UDF. Still I am getting this error when joining two dataframes.
df1 - one column, is primary key of the table, say "customerHash". It may be empty(In fact in my current case, it is empty).
df2 - a table which also has customerHash column, but it's primary key column is different.
result = df1\
.select("customerHash")\
.distinct()\
.join(df2, ["customerHash"], 'inner')
The code runs successfully, but when I try to display/collect/persist the result table, it throws the mentioned error. I have absolutely no idea why it's happening - My guess will be because the df1 is empty. But joins don't throw errors when tables are empty, right?
My main goal is to get only those rows of df2 whose customerHash is in df1. I could use
df2.filter(F.col("customerHash").isin(df1.select("customerHash").distinct().collect()....))
but I don't want to use it as it is very slow.
Please help!