0

I am currently working on storing a spark DataFrame as a .csv file in blob storage on Azure. I am using the following code.

 smtRef2_DF.dropDuplicates().coalesce(1).write
  .mode("overwrite")
  .format("com.databricks.spark.csv")
  .option("header", "true")
  .save(csvBlobStorageMount + "/Output/Smt/SmtRef.csv")

This works but it creates a SmtRef.csv folder where the actual .csv file is stored as part-00000-tid.csv. How do I specify the name of the actual .csv file?

Thanks in Advance

Connor Blair
  • 73
  • 2
  • 10
  • I don't think this question should be closed - saving as a single file is not like renaming a file. here is an option for renaming with PYARROW & pathlib def rename_file_hdfs(hdfs_path): phc = pyarrow.hdfs.connect() fl = phc.ls(hdfs_path) fl = [f for f in fl if pathlib.Path(f).stem.startswith("part)] for i, f in enumerate(fl): pa = Path(fl[0]).parent nf = f"newf{i}.csv" tp = Path(pa, nf) tp = str(tp).replace("hdfs:/", "hdfs://") phc.mv(f"{f}", f"{tp}") – skibee Apr 05 '20 at 07:49

2 Answers2

2

If the file is small enough to fit into memory, one work around is to convert to a pandas dataframe and save as csv from there.

df_pd = df.toPandas()
df_pd.to_csv("path")
Keshinko
  • 318
  • 1
  • 11
1

It’s not possible with spark api.

If you want to achieve this please use .repartition(1) which will generate one PART file and then Use Hadoop file system api to rename the file in HDFS

import org.apache.hadoop.fs._ FileSystem.get(spark.sparkContext.hadoopConfiguration()).rename(new Path(“oldpathtillpartfile”), new path(“newpath”))

Chandan Ray
  • 2,031
  • 1
  • 10
  • 15