2

I am trying to save my dataFrame in s3 like below:

myDF.write.format("com.databricks.spark.csv").options(codec="org.apache.hadoop.io.compress.GzipCodec").save("s3n://myPath/myData.csv")

Then I got errors:

<console>:132: error: overloaded method value options with alternatives:
  (options: java.util.Map[String,String])org.apache.spark.sql.DataFrameWriter <and>
  (options: scala.collection.Map[String,String])org.apache.spark.sql.DataFrameWriter
 cannot be applied to (codec: String)

Does anyone know what I missed? Thanks!

Edamame
  • 23,718
  • 73
  • 186
  • 320

1 Answers1

6

Scala is not Python. It doesn't have **kwargs. You have to provide Map:

myDF.write.format("com.databricks.spark.csv")
  .options(Map("codec" -> "org.apache.hadoop.io.compress.GzipCodec"))
  .save("s3n://myPath/myData.csv")
5ba86145
  • 61
  • 1
  • Instead of saving to one myData.csv file, I actually got a myData.csv "folder", where multiple csv.gz files are stored under the folder. Is there a way just to save it to a csv file. Thanks! – Edamame May 24 '16 at 03:58
  • 1
    @Edamame You cannot have a single file [without coalescing to a single partition](http://stackoverflow.com/a/31675351/1560062) and this is basically useless unless size of the output is negligible. – zero323 May 24 '16 at 04:06
  • @zero323: Thanks! assuming I coalescing to a single partition, how do I save it to one csv file? Thanks! – Edamame May 24 '16 at 04:16
  • use repartition as mentioned in zero323's comment – Rahul Jan 21 '21 at 15:10