How do I rename the file that was saved on a datalake in Azure

Question

I tried to merge two files in a Datalake using scala in data bricks and saved it back to the Datalake using the following code:

val df =sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("adl://xxxxxxxx/Test/CSV") 
df.coalesce(1).write.
              format("com.databricks.spark.csv").
              mode("overwrite").
              option("header", "true").
save("adl://xxxxxxxx/Test/CSV/final_data.csv")

However the file final_data.csv is saved as a directory instead of a file with multiple files and the actual .csv file is saved as 'part-00000-tid-dddddddddd-xxxxxxxxxx.csv'.

How do I rename this file so that I can move it to another directory?

score 6 · Accepted Answer · answered Jan 15 '19 at 01:26

6

Got it. It can be renamed and placed into another destination using the following code. Also current files that were merged will be deleted.

val x = "Source"
val y = "Destination"
val df = sqlContext.read.format("csv")
        .option("header", "true").option("inferSchema", "true")
        .load(x+"/")
df.repartition(1).write.
   format("csv").
   mode("overwrite").
   option("header", "true").
   save(y+"/"+"final_data.csv")
dbutils.fs.ls(x).filter(file=>file.name.endsWith("csv")).foreach(f => dbutils.fs.rm(f.path,true))
dbutils.fs.mv(dbutils.fs.ls(y+"/"+"final_data.csv").filter(file=>file.name.startsWith("part-00000"))(0).path,y+"/"+"data.csv")
dbutils.fs.rm(y+"/"+"final_data.csv",true)

answered Jan 15 '19 at 01:26

sri sivani charan

399
1
6
21

1

just got started with databricks, could you please tell me where is the rename of the (part-00000 file) is happening? I was able to move file to different folders but not able to rename it with dbutils. – sab Jan 23 '20 at 13:24
1

dbutils.fs.mv has the effect of renaming a file. Although it rather copies and deletes the old file. As far as I know there is no real rename function for Databricks. – Trionet Aug 26 '20 at 07:51
"NameError: name 'dbutils' is not defined Traceback (most recent call last): NameError: name 'dbutils' is not defined" I am getting above error – Muhammad Waheed Feb 09 '22 at 07:56

score -1 · Answer 2 · edited Jul 25 '22 at 14:37

This is working for me

Python

y = "dbfs:/mnt/myFirstMountPoint/apltperf/Shiv/Destination" 
df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load(x+"/")
df.repartition(1).write.format("csv").mode("overwrite").save(y+"/"+"final_data.csv")
spark.conf.set('x', str(x)) spark.conf.set('y', str(y))

Scala

var x=spark.conf.get("x")  
var y=spark.conf.get("y")  dbutils.fs.ls(x).filter(file=>file.name.endsWith("csv")).foreach(f => dbutils.fs.rm(f.path,true)) 
dbutils.fs.mv(dbutils.fs.ls(y+"/"+"final_data.csv").filter(file=>file.name.startsWith("part-00000"))(0).path,y+"/"+"data.csv") 
dbutils.fs.rm(y+"/"+"final_data.csv",true)

How do I rename the file that was saved on a datalake in Azure

2 Answers2

This is working for me

Linked