I have 2 dataframes.
One dataframe DF1 has a column with values that are seperated by a delimiter say comma. The second dataframe DF2 has a column with single value (this could be part of the comma seperated column values in the other dataframe, DF1). I need to iterate DF2 records/rows, and see if DF2.color exists in the comma seperated column values in DF1.csv_column, and if exists add the df1 row ID TO A NEW DATAFRAME.
df1= sqlContext.createDataFrame([("A001","RED, WHITE, BLUE"),("A002","RED, YELLOW"),("A003","GREEN,RED"),("A004","WHITE,YELLOW")], ["id","csv_column"])
df1.show()
df2= sqlContext.createDataFrame([("C1","RED"),("C2","WHITE"),("C3","BLUE"),("C4","YELLOW"),("C5","RED"),("C6","GREEN"),("C7","BLUE")], ["CLRCODE","COLOR"])
df2.show()
+----+----------------+
| id | csv_column |
+----+----------------+
|A001|RED, WHITE, BLUE|
|A002|RED, YELLOW |
|A003|GREEN, RED |
|A004|WHITE, YELLOW |
+----+----------------+
+-------+-------+
|CLRCODE| COLOR |
+-------+-------+
| C1 | RED |
| C2 | WHITE |
| C3 | BLUE |
| C4 | YELLOW|
| C5 | RED |
| C6 | GREEN |
| C7 | BLUE |
+-------+-------+
Expected Result: The column csv_column in df1 has RED,WHITE,BLUE and so I have add the IDs for RED,WHITE,BLUE as a row to the new dataframe, and so on. Please note CLRCODE in DF2, is just a place holder and is not used hence. How to get this result.
+-------+
|df1.id |
+-------+
|A001 |
|A002 |
|A003 |
|A001 |
|A004 |
|A001 |
|A002 |
|A004 |
|A001 |
|A002 |
|A003 |
|A003 |
|A001 |
+-------+
I checked in this SO solution here, but there the dataframe comma seperated column is validated against a static string value, but I need to iterate through a dataframe which contains many rows of various values.
Thanks for the help.