3

Hope everyone is doing well...

We are exploring to see if it will be possible to organize a few of our jars as part of a folder in Workspace and have it moved around as part of the init scripts.

For example, in the workspace we have the following structure.

/Workspace/<Folder_Name1>/jars/sample_name_01.jar

The init script would attempt to move it to a path in DBFS/Driver Node File System.

!/bin/bash
cp /Workspace/<Folder_Name1>/jars/sample_name_01.jar /dbfs/jars/
cp /Workspace/<Folder_Name1>/jars/sample_name_01.jar /tmp/jars/

Of course the init script is failing with the error message

cp: cannot stat '/Workspace/<Folder_Name1>/jars/sample_name_01.jar': No such file or directory

Have tried with the path having both /Workspace included and removed. I have also tried accessing the file from the web terminal and I am able to see the files.

  1. Are workspace files accessible via init script ?
  2. Is there a limitation for jars and whl/egg files ?
  3. What is the right syntax to access them ?
  4. Does it make sense to have the jars (only few) as part of the workspace files or in DBFS ?

Thanks for all the help... Cheers...

Update 01:

Tried some of the suggestions received via other means...

  1. Considering that the init scripts from Workspace are referred without the /Workspace I have also tried without them, but still the same issue.
  2. Have also tried listing files and printing them. The path itself does not seem to get recognized.
  3. Have also tried sleeping for upto 2 minutes to give some time for mounts, still nothing...
rainingdistros
  • 450
  • 3
  • 11

3 Answers3

2

As per a related post in the databricks community forums, it has been confirmed that for now, this is not possible. When an init script is placed in a workspace, access is limited only to that init script and not to any other files in the workspace. It also mentioned that accessing the files would still be possible through API Calls or through databricks CLI but I personally feel that it makes it slightly a roundabout way of doing it, just my personal opinion is all. Thank you for all the help. I hope and look forward to better ways of doing it.

rainingdistros
  • 450
  • 3
  • 11
  • I think this may be the updated link: https://community.databricks.com/t5/data-engineering/accessing-workspace-files-within-cluster-init-script/m-p/3183 – FRG96 Jul 07 '23 at 07:22
  • Thank you for sharing the updated link - have updated answer. – rainingdistros Jul 10 '23 at 06:29
1

First, check you have permissions to the workspace and jar folders. If you and still cp is not working, below are the possible reasons.

When admins upload jar files, there are two options.

  1. Upload jars has library.
  2. Upload jars as just a file.

Option 1 Below is how it done when it is uploaded as library.

enter image description here

After, it prompts for upload.

enter image description here

After clicking on create, below is the result.

enter image description here

Here you can see, it gives option to install on cluster, and Source, which is needed for you.

When uploading as library, you will get the jars in dbfs path by default in the below location.

/dbfs/FileStore/jars/

Option 2

When it is uploaded as just file. enter image description here

prompt for file upload and create.

Below are jars uploaded.

enter image description here

You use your copy command on jar, uploaded as file, that will work.

If you still get same error, then it is required permission. So, the possible solution for it is, you can run below code in notebook after cluster creation.

%sh
cp '/Workspace/Users/xxxxxxx/jars/helloworld-2.0_tmp (1).jar' /dbfs/jars/
ls /dbfs/jars/

enter image description here

Note - This does not work if admins upload as library. As i mentioned above they will be available in dbfs only.

JayashankarGS
  • 1,501
  • 2
  • 2
  • 6
  • Thank you so much for your detailed answer..I can confirm that I have the necessary permissions on the Workspace Folder - I seem to have manage privileges on that path (Just for testing purposes). I can also confirm that the jar is copied as a file and not as a library. When I run a similar command to copy the file from notebook, it still gives file not found. One observation is that I guess you are uploading to a path within your Repos section ? Can you maybe upload the jar to a new folder in the same level as Shared and Users and see if it works ? – rainingdistros Jun 13 '23 at 15:11
  • I uploaded the file to jars folder , which is under `Users`. – JayashankarGS Jun 14 '23 at 07:16
  • The ask is accessing from the inside of the init script - your example is for copying files using the notebooks... – Alex Ott Jun 17 '23 at 15:08
  • Yeah @AlexOtt. I have mentioned if the file is uploaded directly not as library the copy command should work, if that doesn't work then I suggested to using notebooks. – JayashankarGS Jun 18 '23 at 03:56
0

Just copying the answer, I found useful in this Databricks Community Forum: https://community.databricks.com/t5/data-engineering/accessing-workspace-files-within-cluster-init-script/m-p/3183

The init script runs on the cluster nodes before the notebook execution, and it does not have direct access to workspace files.

The documentation you mentioned refers to placing the init script inside a workspace file, which means you can store the script itself in a file within the Databricks workspace. However, it doesn't grant direct access to other workspace files from within the init script.

To access a workspace file within the init script, you can consider using the Databricks CLI or Databricks API to retrieve the file and then copy or read it on the cluster nodes during the init script execution.

FRG96
  • 151
  • 1
  • 9