Python Spark Dataframes: Better way to export groups to text file

Question

I want to export data to separate text files; I can do it with this hack:

for r in sqlContext.sql("SELECT DISTINCT FIPS FROM MY_DF").map(lambda r: r.FIPS).collect():
    sqlContext.sql("SELECT * FROM MY_DF WHERE FIPS = '%s'" % r).rdd.saveAsTextFile('county_{}'.format(r))

What is the right way to do it with Spark 1.3.1/Python dataframes? I want to do it in a single job as opposed to N (or N + 1) jobs.

May be:

saveAsTextFileByKey()

There is a way to do this in PySpark 1.4+: http://stackoverflow.com/a/37150604/877069 — Nick Chammas, May 19 '16 at 13:48
Possible duplicate of [Write to multiple outputs by key Spark - one Spark job](http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job) — Nick Chammas, May 19 '16 at 13:48

score 2 · Accepted Answer · edited May 23 '17 at 11:59

2

Spark in general does not have RDD operations with multiple outputs. But for writing files there is a nice trick: Write to multiple outputs by key Spark - one Spark job

edited May 23 '17 at 11:59

Community

1
1

answered Jun 08 '15 at 14:27

Daniel Darabos

26,991
10
102
114

Sorry, I don't know if that's possible to do from PySpark. I have no experience with the Python interface. – Daniel Darabos Jun 08 '15 at 14:28
Hey yeah. I saw this post, but it was unclear how to implement on the python side. – bcollins Jun 08 '15 at 15:28
1

It might not be possible. While PySpark covers most of the Spark API, you need access to the Hadoop file API too to make this work. Let's hope your bounty attracts someone who actually knows the Python API! – Daniel Darabos Jun 09 '15 at 07:15

Python Spark Dataframes: Better way to export groups to text file

1 Answers1