How to Copy Files with Databricks dbutilis in particular order

Question

A member from this group assisted me in copying files to a follow based on date

I would like to tweak the code to copy file based on certain characters in a filename – in the example that follows the characters are 1111, 1112, 1113 and 1114 So, if we have four files as follows:

File_Account_1111_exam1.csv File_Account_1112_testxx.csv File_Account_1113_pringle.csv File_Account_1114_sam34.csv

I would like File_Account_1114_sam34.csv copied to the folder only if File_Account_1113_pringle.csv has already been copied to the folder. Likewise I would only want File_Account_1113_pringle.csv copied if File_Account_1112_testxx.csv has been already been copied to the folder and so on.

Therefore, if all files have been copied to a folder it would look something like the following:

dbutils.fs.put("/mnt/adls2/demo/files/file_Account_1111_exam1.csv", data, True)
dbutils.fs.put("/mnt/adls2/demo/files/file_Account_1112_testxx.csv", data, True)
dbutils.fs.put("/mnt/adls2/demo/files/file_Account_1113_pringle.csv", data, True)
dbutils.fs.put("/mnt/adls2/demo/files/file_Account_1114_sam34.csv", data, True)

It looks like you are trying to implement some business logic using copy file operation which will be complex to test automatically and maintain. More straightforward, flexible and testable is to copy all available files then apply the business logic on a data pipeline level. — David Greenshtein, Jan 11 '19 at 14:46
@DavidGreenshtein, yes I am trying to implement business logic. However, I'm struggling. — Carltonp, Jan 11 '19 at 22:18
how do you read the files after the copy operation completion? — David Greenshtein, Jan 12 '19 at 11:38
@DavidGreenshtein, I do a spark.read.csv. I appreciate the complexity of the question, however I was hoping that if no one is able to provide a answer, then may be someone could set me off in the right direction to a solution? — Carltonp, Jan 12 '19 at 12:22
I'm trying to get the ban lifted from stackoverflow, by improving the question here as I have a score of -1. But I can't see where I should edit this question to improve it and get the ban lifted. Can someone let me know why this is a bad question? And how to improve it? — Carltonp, Sep 18 '19 at 13:11

score 1 · Accepted Answer · answered Jan 13 '19 at 19:18

Instead of applying any business logic when uploading files to DBFS I would recommend uploading all available files, then read them using test = sc.wholeTextFiles("pathtofile") which will return the key/value RDD of the file name and the file content, here is a corresponding thread. Once it is done any sort or filtering business logic based on file name may be implemented and tested in Spark job.

I hope it is helpful.

How to Copy Files with Databricks dbutilis in particular order

1 Answers1