I am trying to create a column in my Spark Dataframe a flag if a column's row is in a separate Dataframe.
This is my main Spark Dataframe (df_main
)
+--------+
|main |
+--------+
|28asA017|
|03G12331|
|1567L044|
|02TGasd8|
|1asd3436|
|A1234567|
|B1234567|
+--------+
This is my reference (df_ref
), there are hundreds of rows in this reference so I obviously can't hard code them like this solution or this one
+--------+
|mask_vl |
+--------+
|A1234567|
|B1234567|
...
+--------+
Normally, what I'd do in pandas' dataframe is this:
df_main['is_inref'] = np.where(df_main['main'].isin(df_ref.mask_vl.values), "YES", "NO")
So that I would get this
+--------+--------+
|main |is_inref|
+--------+--------+
|28asA017|NO |
|03G12331|NO |
|1567L044|NO |
|02TGasd8|NO |
|1asd3436|NO |
|A1234567|YES |
|B1234567|YES |
+--------+--------+
I have tried the following code, but I don't get what the error in the picture means.
df_main = df_main.withColumn('is_inref', "YES" if F.col('main').isin(df_ref) else "NO")
df_main.show(20, False)