I have reached to some point where I have got the result of a groupByKey
operation. Now I want to write each (key, value)
pair into different files using k
as their names and v
as their content.
Firstly, I tried to collect those results in the driver program in order to use open()
and write to local files, but it failed because of a buffer overflow as the size of the results is very large.
I then tried to filter those (key, value)
pair to make new RDD in order to use saveAsTextFile
for each new RDD, but it seems to become too slow for the network communications.
Now I think maybe it should be faster if I can use a foreach
operation over the groupByKey
RDD and write to HDFS files directly in each iteration, but I have no idea which function to refer to in the python API. Could anyone show me some examples or tell me another way to accomplish my goal in pyspark?
There is a similar question here on SO, however that solution is in scala while I want to solve it in python, because I found no way to communicate directly with HDFS in pyspark.