12

I have dataframe and i want to save in single file on hdfs location.

i found the solution here Write single CSV file using spark-csv

df.coalesce(1)
    .write.format("com.databricks.spark.csv")
    .option("header", "true")
    .save("mydata.csv")

But all data will be written to mydata.csv/part-00000 and i wanted to be mydata.csv file.

is that possible?

any help appreciate

Community
  • 1
  • 1
shikha dubey
  • 139
  • 1
  • 1
  • 5
  • the only way, afaik, is to repartition to 1 partition before you do this – elmalto Nov 24 '16 at 18:42
  • 2
    It's not possible!! pls check the answer at [this link](http://stackoverflow.com/questions/40577546/how-to-save-rdd-data-into-json-files-not-folders/40577736#40577736) – mrsrinivas Nov 24 '16 at 19:01

1 Answers1

27

It's not possible using standard spark library, but you can use Hadoop API for managing filesystem - save output in temporary directory and then move file to the requested path. For example (in pyspark):

df.coalesce(1) \
    .write.format("com.databricks.spark.csv") \
    .option("header", "true") \
    .save("mydata.csv-temp")

from py4j.java_gateway import java_import
java_import(spark._jvm, 'org.apache.hadoop.fs.Path')

fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
file = fs.globStatus(sc._jvm.Path('mydata.csv-temp/part*'))[0].getPath().getName()
fs.rename(sc._jvm.Path('mydata.csv-temp/' + file), sc._jvm.Path('mydata.csv'))
fs.delete(sc._jvm.Path('mydata.csv-temp'), True)
blackbishop
  • 30,945
  • 11
  • 55
  • 76
Mariusz
  • 13,481
  • 3
  • 60
  • 64