3

I'm trying to write a script in databricks that will select a file based on certain characters in the name of the file or just on the datestamp in the file.

For example, the following file looks as follows:

LCMS_MRD_Delta_LoyaltyAccount_1992_2018-12-22 06-07-31

I have created the following code in Databricks:

import datetime
now1 = datetime.datetime.now()
now = now1.strftime("%Y-%m-%d")

Using the above code I tried to select the file using following:

LCMS_MRD_Delta_LoyaltyAccount_1992_%s.csv'% now

However, if you look closely you will notice that there is a space between the datestamp and the timestamp, i.e between 22 and 06

LCMS_MRD_Delta_LoyaltyAccount_1992_2018-12-22 06-07-31

It is because if this space that is preventing my code above from working.

I don't think Databricks supports wildcards so the following won't work:

LCMS_MRD_Delta_LoyaltyAccount_1992_%s.csv'% now

Someone once suggested TRUNCATING the timestamp.

Can someone let me know if:

A.TRUNCATING will solve this problem B.Is there a way to my code LCMS_MRD_Delta_LoyaltyAccount_1992_%s.csv'% now

To select the whole file? Bearing in mind I definitely need to select based on current date.. I just want to be able to use my code to select on the file.

Carltonp
  • 1,166
  • 5
  • 19
  • 39

1 Answers1

4

You can read filenames with dbutils and can check if a pattern matches in an if-statement: if now in filname. So instead of reading files with a specific pattern directly, you get a list of files and then copy the concrete files matching your required pattern.

The following code works in a databricks python notebook:

1. Writing three files to the filesystem:

data = """
{"a":1, "b":2, "c":3}
{"a":{, b:3} 
{"a":5, "b":6, "c":7}

"""

dbutils.fs.put("/mnt/adls2/demo/files/file1-2018-12-22 06-07-31.json", data, True)
dbutils.fs.put("/mnt/adls2/demo/files/file2-2018-02-03 06-07-31.json", data, True)
dbutils.fs.put("/mnt/adls2/demo/files/file3-2019-01-03 06-07-31.json", data, True)

2. Reading the filnames as a list:

files = dbutils.fs.ls("/mnt/adls2/demo/files/")

3. Getting the actual date:

import datetime

now = datetime.datetime.now().strftime("%Y-%m-%d")
print(now)

Output: 2019-01-03

4. Copy actual files:

for i in range (0, len(files)):
  file = files[i].name
  if now in file:  
    dbutils.fs.cp(files[i].path,'/mnt/adls2/demo/target/' + file)
    print ('copied     ' + file)
  else:
    print ('not copied ' + file)

Output:

not copied file1-2018-12-22 06-07-31.json

not copied file2-2018-02-03 06-07-31.json

copied file3-2019-01-03 06-07-31.json

Hauke Mallow
  • 2,887
  • 3
  • 11
  • 29
  • wow, if this works I'll need to inform Databricks engineers. I was on call with them for 90mins trying to figure this out. After 90mins they said they would need more time and get back to me. I will check this out in the next 30mins and get back to you. Wow! – Carltonp Jan 03 '19 at 19:53
  • I'm a little confused when you say 'Just check now in should do it.' – Carltonp Jan 03 '19 at 20:18
  • I managed to follow your example exactly and I got the same result you did - trust me that's a major feat for me. However, am I now supposed to add **now** here `LCMS_MRD_Delta_LoyaltyAccount_1992_%s.csv'% now` and get the same result as I did in the example here `LCMS_MRD_Delta_LoyaltyAccount_1992_2019-01-03 06-05-52.csv` – Carltonp Jan 03 '19 at 20:46
  • Could you please share the code, don't exactly what you mean. Couldn't you use the filename, so in my example "file" matching the pattern instead of constructing it? – Hauke Mallow Jan 03 '19 at 21:29
  • if you go to the following link, you'll be able to view the code. [link]https://drive.google.com/file/d/1NA8B1_pBH_vQB1mduxS7eao5BtFiY3GP/view?usp=sharing[link] – Carltonp Jan 04 '19 at 11:44
  • You may need to view the html with chrome or MS Edge – Carltonp Jan 04 '19 at 11:46
  • I updated my answer, sorry for the confusion: Do the copy in the loop with the filename (full path) you get with dbutils.ls in the step before, that works. Pls. delete the file on the google drive. Tip: Use a secret scope to store client.id etc.Hope this helps. – Hauke Mallow Jan 04 '19 at 12:29
  • this is absolutely brilliant. I have also just shown the code to DATABRICKS, and they also thank you. Thank you soooooo much. This works amazingly well. – Carltonp Jan 04 '19 at 13:37
  • I just wanted to give you another big shout-out for this amazing solution. This is echoed by the engineers at Databricks. – Carltonp Jan 05 '19 at 06:43
  • Thanks Carltonp, good to hear that this is solving your problem! – Hauke Mallow Jan 05 '19 at 06:59
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/186215/discussion-between-carltonp-and-hauke-mallow). – Carltonp Jan 05 '19 at 07:09