8

A very clever person from StackOverflow assisted me in copying files to a directory from Databricks here: copyfiles

I am using the same principle to remove the files once it has been copied as shown in the link:

for i in range (0, len(files)):
  file = files[i].name
  if now in file:  
    dbutils.fs.rm(files[i].path,'/mnt/adls2/demo/target/' + file)
    print ('copied     ' + file)
  else:
    print ('not copied ' + file)

However, I'm getting the error:

TypeError: '/mnt/adls2/demo/target/' has the wrong type - class bool is expected.

Can someone let me know how to fix this. I thought it would be simple matter of removing the file after originally copying it using command dbutils.fs.rm

Carltonp
  • 1,166
  • 5
  • 19
  • 39
  • ok, the above example didn't reflect the script we have in production which is: `for i in range (0, len(files)): file = files[i].name if now in file: dbutils.fs.rm(files[i].path,'adl://xxxxxxxxxxxx.azuredatalakestore.net/Folder Structure/RAW/1stParty/LCMS/DE/stageone/') print ('removed ' + file) else: print ('not removed ' + file)` The problem was because I missed the open brackets . So, the problem isn't **the wrong type class bool is expected** as stated above, the problem is invalid syntax error at `print ('removed ' + file)`. I hope that helps to fix. – Carltonp Jan 08 '19 at 14:43

3 Answers3

17

If you want to delete all files from the following path: '/mnt/adls2/demo/target/', there is a simple command:

dbutils.fs.rm('/mnt/adls2/demo/target/', True)

Anyway, if you want to use your code, take a look at dbutils doc:

rm(dir: String, recurse: boolean = false): boolean -> Removes a file or directory

The second argument of the function is expected to be boolean, but your code has string with path:

dbutils.fs.rm(files[i].path, '/mnt/adls2/demo/target/' + file)

So your new code can be following:

for i in range (0, len(files)):
    file = files[i].name
        if now in file:  
            dbutils.fs.rm(files[i].path + file, True)
            print ('copied     ' + file)
        else:
            print ('not copied ' + file)
Jaroslav Bezděk
  • 6,967
  • 6
  • 29
  • 46
Fabio Schultz
  • 488
  • 2
  • 9
  • wow @Fabio, I will test this out in the morning. If this works I will won't understand how Databricks experts (whom I have support contract with) couldn't figure this out. Thanks in advance. I will let you know how I get on with it. Cheers – Carltonp Jan 08 '19 at 23:12
  • 2
    the following command you suggested deletes the folder as well as the files `dbutils.fs.rm('/mnt/adls2/demo/target/', True) ` I just need the files deleted – Carltonp Jan 09 '19 at 11:41
  • The actual command in our production is `dbutils.fs.rm('adl://devszendsadlsrdpacqncd.azuredatalakestore.net/Folder Structure/RAW/1stParty/LCMS/DE/stageone', True)` – Carltonp Jan 09 '19 at 11:43
  • Hello @Carltonp, sorry I'm late to answer you. So I got you can't delete the folder. I have a sugestion, you can use `dbutils.fs.rm('/mnt/adls2/demo/target/', True)` so after that you can create a folder again `dbutils.fs.mkdirs('/mnt/adls2/demo/target/', True)` ... If it does not work for you you can list all files and delete one by on like you tried before – Fabio Schultz Jan 09 '19 at 16:37
  • 'use dbutils.fs.rm('/mnt/adls2/demo/target/', True) so after that you can create a folder again' that's exactly what I did. Thank you soooo much. Also, just so you know both of your suggestions worked. I hope you don't mind, but I have shared your solution with Databricks. Thanks man – Carltonp Jan 09 '19 at 18:41
  • @Carltonp I'm glad to know that :) – Fabio Schultz Jan 09 '19 at 19:40
  • The problem with your solution (delete and create) is that, that it removes the permission settings on the datalake folders too. So the customer cannot access the data when some special permission was set on the folder.. – gszecsenyi Mar 17 '20 at 17:02
  • @FabioSchultz The mkdir command above should be `dbutils.fs.mkdirs('/mnt/adls2/demo/target/')` it only takes one argument. – Aaron Robeson Sep 16 '20 at 14:14
2

In order to remove the files from dbfs you can write this in any notebook

%fs rm -r dbfs:/user/sample_data.parquet
Shikha
  • 229
  • 3
  • 5
2

If you have huge number of files the deleting them in this way might take a lot of time. you can utilize spark parallelism to delete the files in parallel. Answer that I am providing is in scala but can be changed to python.

you can check if the directory exists or not using this function below:

import java.io._
def CheckPathExists(path:String): Boolean = 
{
  try
  {
    dbutils.fs.ls(path)
    return true
  }
  catch
  {
    case ioe:java.io.FileNotFoundException => return false
  }
}

You can define a function that is used to delete the files. you are creating this function inside an object and extends that object from Serializable class as below :

object Helper extends Serializable
{
def delete(directory: String): Unit = {
    dbutils.fs.ls(directory).map(_.path).toDF.foreach { filePath =>
      println(s"deleting file: $filePath")
      dbutils.fs.rm(filePath(0).toString, true)
    }
  }
}

Now you can first check to see if the path exists and if it returns true then you can call the delete function to delete the files within the folder on multiple tasks.

val directoryPath = "<location"
val directoryExists = CheckPathExists(directoryPath)
if(directoryExists)
{
Helper.delete(directoryPath)
}
Nikunj Kakadiya
  • 2,689
  • 2
  • 20
  • 35