I have a pyspark RDD (myRDD
) that is a variable-length list of IDs, such as
[['a', 'b', 'c'], ['d','f'], ['g', 'h', 'i','j']]
I have a pyspark dataframe (myDF
) with columns ID
and value
.
I want to query myDF
with the query:
outputDF = myDF.select(F.collect_set("value")).alias("my_values").where(col("ID").isin(id_list))
where id_list
is an element from the myRDD
, such as ['d','f'] or ['a', 'b', 'c'].
An example would be:
outputDF = myDF.select(F.collect_set("value")).alias("my_values").where(col("ID").isin(['d','f']))
What is a parallelizable way to use the RDD to query the DF like this?