How to directly write to HDFS file rather than using saveAsTextFile in python spark?

Question

I have reached to some point where I have got the result of a groupByKey operation. Now I want to write each (key, value) pair into different files using k as their names and v as their content.

Firstly, I tried to collect those results in the driver program in order to use open() and write to local files, but it failed because of a buffer overflow as the size of the results is very large.

I then tried to filter those (key, value) pair to make new RDD in order to use saveAsTextFile for each new RDD, but it seems to become too slow for the network communications.

Now I think maybe it should be faster if I can use a foreach operation over the groupByKey RDD and write to HDFS files directly in each iteration, but I have no idea which function to refer to in the python API. Could anyone show me some examples or tell me another way to accomplish my goal in pyspark?

There is a similar question here on SO, however that solution is in scala while I want to solve it in python, because I found no way to communicate directly with HDFS in pyspark.

possible duplicate of [Write to multiple outputs by key Spark - one Spark job](http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job) — banjara, May 21 '15 at 05:59
@shekhar, yes, the goal is exactly the same but I want a solution in python rather than in scala, do you have any idea? — kuixiong, May 21 '15 at 07:07

How to directly write to HDFS file rather than using saveAsTextFile in python spark?

0 Answers0