I have 2 dataframes named - brand_name and poi_name.
Dataframe 1(brand_name):-
+-------------+
|brand_stop[0]|
+-------------+
|TOASTMASTERS |
|USBORNE |
|ARBONNE |
|USBORNE |
|ARBONNE |
|ACADEMY |
|ARBONNE |
|USBORNE |
|USBORNE |
|PILLAR |
+-------------+
Dataframe 2:-(poi_name)
+---------------------------------------+
|Name |
+---------------------------------------+
|TOASTMASTERS DISTRICT 48 |
|USBORNE BOOKS AND MORE |
|ARBONNE |
|USBORNE BOOKS AT HOME |
|ARBONNE |
|ACADEMY, LTD. |
|ARBONNE |
|USBORNE BOOKS AT HOME |
|USBORNE BOOKS & MORE |
|PILLAR TO POST HOME INSPECTION SERVICES|
+---------------------------------------+
I want to check whether the strings in brand_stop column of dataframe 1 are present in Name column of dataframe 2. The matching should be done row wise and then if there is a successful match, that particular record should be stored in a new column.
I have tried filtering the dataframe using Join:-
from pyspark.sql.functions import udf, col
from pyspark.sql.types import BooleanType
contains = udf(lambda s, q: q in s, BooleanType())
like_with_python_udf = (poi_names.join(brand_names1)
.where(contains(col("Name"), col("brand_stop[0]")))
.select(col("Name")))
like_with_python_udf.show()
But this shows an error
"AnalysisException: u'Detected cartesian product for INNER join between logical plans"
I am new to PySpark. Please help me with this.
Thank you