Append new tables in the same csv file using Scala Spark

Question

I want to append the table metrics in an existing CSV file. I use the code below:

    metrics.coalesce(1)
    .write
    .option("header", "true")
    .option("sep",",")
    .mode("Append")
    .csv("data/outputs.csv}")

Every time the block of code below runs, a new file .part-00000-{xxxxxx-xxxxx......}.csvis created in data/outputs.csv (where outputs.csv is a folder instead of a CSV file).

Is there any way to always append the table in the same CSV file and not creating a new .csv file? and any way to define the final name of this CSV, instead of using the .part-00000-{xxxxxx-xxxxx......}.csv format?

I tried with the save mode append option (.mode(SaveMode.Append)) with the same duplication result.

there is no direct way of doing this because of spark distributed nature but you can do that after you save that using spark. See this link to get the reference : https://stackoverflow.com/questions/40792434/spark-dataframe-save-in-single-file-on-hdfs-location . I think you should not do that unless you have a use case which has no work around. — Nikunj Kakadiya, Apr 10 '21 at 06:29

score 0 · Accepted Answer · answered Apr 09 '21 at 20:02

You gotta union them explicitly, and then use Overwrite rather than append. Something like this:

spark
  .read
  .option(...)
  .csv("data/outputs.csv")
  .union(metrics)
  .coalesce(1)
  .write
  .option(...)
  .mode("Overwrite")
  .csv("data/outputs.csv")

Append new tables in the same csv file using Scala Spark

1 Answers1