I want to split an RDD into multiple RDD based on a value in a row. The values in rows are pre-known and are fixed in nature.
for e.g.
source_rdd = sc.parallelize([('a',1),('a',2),('a',3),('b',4),('b',5),('b',6)])
should be split into two RDDs with one containing only a
and another containing only b
as keys
- I have tried
groupByKey
method and able to do it successfully after doing acollect()
operation on grouped RDD, which I cannot do in production due to memory constraints
a_rdd, b_rdd = source_rdd.keyBy(lambda row: row[0]).groupByKey().collect()
- The current implementation is to apply multiple filter operation to get each RDD
a_rdd = source_rdd.filter(lambda row: row[0] == 'a')
b_rdd = source_rdd.filter(lambda row: row[0] == 'b')
Can this be optimized further, what will be the best way to do it in production with data which cannot fit in memory?
Usage: These RDD will be converted into different Dataframes (one for each key), each with different schema and stored in S3 as output.
Note: I would prefer pyspark
implementation. I have read a lot of stack overflow answers and blogs, and could not find anyway which is working for me yet.
I have already seen question which is marked duplicate for, which I have already mentioned in my question. I have asked this question as the provided solution seems not the most optimised way and is 3 years old.