11

I want to export data to separate text files; I can do it with this hack:

for r in sqlContext.sql("SELECT DISTINCT FIPS FROM MY_DF").map(lambda r: r.FIPS).collect():
    sqlContext.sql("SELECT * FROM MY_DF WHERE FIPS = '%s'" % r).rdd.saveAsTextFile('county_{}'.format(r))

What is the right way to do it with Spark 1.3.1/Python dataframes? I want to do it in a single job as opposed to N (or N + 1) jobs.

May be:

saveAsTextFileByKey()

bcollins
  • 3,379
  • 4
  • 19
  • 35
  • There is a way to do this in PySpark 1.4+: http://stackoverflow.com/a/37150604/877069 – Nick Chammas May 19 '16 at 13:48
  • Possible duplicate of [Write to multiple outputs by key Spark - one Spark job](http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job) – Nick Chammas May 19 '16 at 13:48

1 Answers1

2

Spark in general does not have RDD operations with multiple outputs. But for writing files there is a nice trick: Write to multiple outputs by key Spark - one Spark job

Community
  • 1
  • 1
Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114
  • Sorry, I don't know if that's possible to do from PySpark. I have no experience with the Python interface. – Daniel Darabos Jun 08 '15 at 14:28
  • Hey yeah. I saw this post, but it was unclear how to implement on the python side. – bcollins Jun 08 '15 at 15:28
  • 1
    It might not be possible. While PySpark covers most of the Spark API, you need access to the Hadoop file API too to make this work. Let's hope your bounty attracts someone who actually knows the Python API! – Daniel Darabos Jun 09 '15 at 07:15