4

i'm trying to save DataFrame into CSV using the new spark 2.1 csv option

 df.select(myColumns: _*).write
                  .mode(SaveMode.Overwrite)
                  .option("header", "true")
                  .option("codec", "org.apache.hadoop.io.compress.GzipCodec")
                  .csv(absolutePath)

everything works fine and i don't mind haivng the part-000XX prefix but now seems like some UUID was added as a suffix

i.e
part-00032-10309cf5-a373-4233-8b28-9e10ed279d2b.csv.gz ==> part-00032.csv.gz

Anyone knows how i can remove this file ext and stay only with part-000XX convension

Thanks

Avi P
  • 53
  • 4
  • Check this http://stackoverflow.com/questions/41990086/specifying-the-filename-when-saving-a-dataframe-as-a-csv – Dhanesh Mar 18 '17 at 15:53
  • 1
    thanks @Dhanesh but im using S3, so renaming after the file is persistent is not that simple (consider files can be > 5GB). the part-000XX is fine. i don't really like the new UUID number attached to the suffix – Avi P Mar 18 '17 at 18:50
  • just see two options - either move the s3 file to a new one with the name you desire, or save to local FS or HDFS, rename it and move it to S3. http://stackoverflow.com/questions/21184720/how-to-rename-files-and-folder-in-amazon-s3 – Dhanesh Mar 19 '17 at 07:12

1 Answers1

3

You can remove the UUID by overriding the configuration option "spark.sql.sources.writeJobUUID":

https://github.com/apache/spark/commit/0818fdec3733ec5c0a9caa48a9c0f2cd25f84d13#diff-c69b9e667e93b7e4693812cc72abb65fR75

Unfortunately this solution will not fully mirror the old saveAsTextFile style (i.e. part-00000), but could make the output file name more sane such as part-00000-output.csv.gz where "output" is the value you pass to spark.sql.sources.writeJobUUID. The "-" is automatically appended

SPARK-8406 is the relevant Spark issue and here's the actual Pull Request: https://github.com/apache/spark/pull/6864

Garren S
  • 5,552
  • 3
  • 30
  • 45
  • 1
    Apparently this is not an option anymore with recent versions of spark, name is created internally as f"part-$split%05d-$jobId$ext" https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala and jobId is coming from hadoops "mapreduce.job.id" that better be not tampered with – MxR Mar 16 '18 at 02:28
  • Hmm, are you aware of new ways to accomplish a similar objective in newer versions? – Garren S Mar 16 '18 at 02:34
  • Alas wasn't able to find any way besides patching spark itself, and given maintenance cost and plain "wrongness" of such act, we decided to rewrite parts of our system that are relying on it. Other option would be pull request to bring this configuration option back, but chances aren't looking good, since it's been refactored away once. Also we actually don't have spark committers in our team to help us fight this uphill battle ^_^ – MxR Mar 16 '18 at 05:02