Writing an RDD to multiple files in PySpark

Question

I have an rdd which contains key value pairs. There are just 3 keys, and I would like to write all the elements for a given key to a textfile. Currently I am doing this in 3 passes, but I wanted to see if I could do it in one pass.

Here is what I have so far:

# I have an rdd (called my_rdd) such that a record is a key value pair, e.g.: 
# ('data_set_1','value1,value2,value3,...,value100')

my_rdd.cache()
my_keys = ['data_set_1','data_set_2','data_set_3']
for key in my_keys:
    my_rdd.filter(lambda l: l[0] == key).map(lambda l: l[1]).saveAsTextFile(my_path+'/'+key)

this works, however caching it and iterating through three times can be a lengthy process. I am wondering if there is any way to simultaneously write all three files?

There's an issue for this: https://issues.apache.org/jira/browse/SPARK-3533. Workaround posted here: http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job — Def_Os, Jan 19 '16 at 00:19
@mgoldwasser -- Its a good case study with rdd however can be easily done with dataframe using partitionBy dataframe writer class.. cheers — vikrant rana, May 16 '19 at 14:08

score 2 · Answer 1 · answered Jan 19 '16 at 03:07

Alternative approach by using customized Partitioner(which partition your dataset before writing to output file, compared to the approach provided by Def_Os)

For Example:
RDD[(K, W)].partitionBy(partitioner: Partitioner)

class CustmozedPartitioner extends Partitioner {

  override def numPartitions: Int = 4

  override def getPartition(key: Any): Int = {
    key match {
      case "data_set_1" => 0
      case "data_set_2" => 1
      case "data_set_3" => 2
      case _ => 3
    } 
  }
}

Can this be written in Python? – mgoldwasser Jan 21 '16 at 20:52 — mgoldwasser, Jan 21 '16 at 20:52

Writing an RDD to multiple files in PySpark

1 Answers1