Rename File When storing Spark DataFrame as .csv

Question

I am currently working on storing a spark DataFrame as a .csv file in blob storage on Azure. I am using the following code.

 smtRef2_DF.dropDuplicates().coalesce(1).write
  .mode("overwrite")
  .format("com.databricks.spark.csv")
  .option("header", "true")
  .save(csvBlobStorageMount + "/Output/Smt/SmtRef.csv")

This works but it creates a SmtRef.csv folder where the actual .csv file is stored as part-00000-tid.csv. How do I specify the name of the actual .csv file?

Thanks in Advance

I don't think this question should be closed - saving as a single file is not like renaming a file. here is an option for renaming with PYARROW & pathlib def rename_file_hdfs(hdfs_path): phc = pyarrow.hdfs.connect() fl = phc.ls(hdfs_path) fl = [f for f in fl if pathlib.Path(f).stem.startswith("part)] for i, f in enumerate(fl): pa = Path(fl[0]).parent nf = f"newf{i}.csv" tp = Path(pa, nf) tp = str(tp).replace("hdfs:/", "hdfs://") phc.mv(f"{f}", f"{tp}") — skibee, Apr 05 '20 at 07:49

score 2 · Answer 1 · answered Aug 29 '18 at 15:21

2

If the file is small enough to fit into memory, one work around is to convert to a pandas dataframe and save as csv from there.

df_pd = df.toPandas()
df_pd.to_csv("path")

answered Aug 29 '18 at 15:21

Keshinko

318
1
11

score 1 · Answer 2 · answered Aug 29 '18 at 15:17

It’s not possible with spark api.

If you want to achieve this please use .repartition(1) which will generate one PART file and then Use Hadoop file system api to rename the file in HDFS

import org.apache.hadoop.fs._ FileSystem.get(spark.sparkContext.hadoopConfiguration()).rename(new Path(“oldpathtillpartfile”), new path(“newpath”))

Rename File When storing Spark DataFrame as .csv

2 Answers2