Pyspark merge based on whether value is in a list

Asked Apr 26 '18 at 08:10

Active Apr 26 '18 at 12:33

Viewed 20 times

I have two Spark dataframes (I'm using python), say A and B. A contains a column with a string (say "Name"), whereas B contains a column with a list of strings (say "NamesList"). What I would like to do is merge A and B based on whether A.Name is contained in B.NamesList.

So to give you an example, A could be

+---+------+
| Id|  Name|
+---+------+
|  1|George|
|  2| Sarah|
+---+------+

B could be

+---+--------------------+
|Id2|           NamesList|
+---+--------------------+
|  6| [Bob, Alice, Sarah]|
|  7|[Thomas, Bob, Alice]|
+---+--------------------+

And I would like the result to be

+---+---+-----+-------------------+
| Id|Id2| Name|          NamesList|
+---+---+-----+-------------------+
|  2|  6|Sarah|[Bob, Alice, Sarah]|
+---+---+-----+-------------------+

Any ideas how to do this in an efficient way?

edited Apr 26 '18 at 12:33

zero323

322,348
103
959
935

asked Apr 26 '18 at 08:10

bettaberg

how big are your dataframes ? – Steven Apr 26 '18 at 09:29
They have around 600K entries. – bettaberg Apr 26 '18 at 11:05

Pyspark merge based on whether value is in a list

0 Answers0