2

Ok, it is very strange. I have some init scripts that I would like to run when a cluster starts

cluster has the init script , which is in a file (in dbfs)

basically this

dbfs:/databricks/init-scripts/custom-cert.sh

Now , when I make the init script like this, it works (no ssl errors for my endpoints. Also, the event logs for the cluster shows the duration as 1 second for the init script

dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")

However, if I just put the init script in an bash script and upload it to DBFS through a pipeline, the init script does not do anything. It executes , as per the event log but the execution duration is 0 sec.

I have the sh script in a file named

custom-cert.sh

with the same contents as above, i.e.

#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt"

but when I check /usr/local/share/ca-certificates/ , it does not contain /dbfs/orgcertificates/orgcerts.crt, even though the cluster init script has run.

Also, I have compared the contents of the init script in both cases and it least to the naked eye, I can't figure out any difference

i.e.

%sh
cat /dbfs/databricks/init-scripts/custom-cert.sh

shows the same contents in both the scenarios. What is the problem with the 2nd case?

EDIT: I read a bit more about init scripts and found that the logs of init scripts are written here

%sh
ls /databricks/init_scripts/

Looking at the err file in that location, it seems there is an error

sudo: update-ca-certificates
: command not found

Why is it that update-ca-certificates found in the first case but not when I put the same script in a sh script and upload it to dbfs (instead of executing the dbutils.fs.put within a notebook) ?

EDIT 2: In response to the first answer. After running the command

dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")

the output is the file custom-cert.sh and then I restart the cluster with the init script location as dbfs:/databricks/init-scripts/custom-cert.sh and then it works. So, it is essentially the same content that the init script is reading (which is the generated sh script). Why can't it read it if I do not use dbfs put but just put the contents in bash file and upload it during the CI/CD process?

Saugat Mukherjee
  • 778
  • 8
  • 32
  • which DBR version? I just checked on 8.1, update-ca-certificates is inside PATH – Alex Ott Sep 28 '21 at 06:56
  • @AlexOtt : 8.3. I also updated the question with EDIT 2, in response to the answer by Karthikeyan – Saugat Mukherjee Sep 28 '21 at 07:10
  • How do you upload the script to DBFS from CI/CD pipeline? – Alex Ott Sep 28 '21 at 07:17
  • @AlexOtt : Using the task "Databricks Files to DBFS" in Azure Devops (by DataThirst) and file gets uploaded fine and with the same content , that I mentioned above – Saugat Mukherjee Sep 28 '21 at 07:19
  • question is how is one suppose to create this "sh script" ? Is it by running the command in a notebook in the CI/CD pipeline? This is what a similar guide talks about (it is quite poor though with no explanation): https://kb.databricks.com/python/import-custom-ca-cert.html. the init script of the cluster needs a dbfs file. the only way it works is if I run the command before in a notebook and then start the cluster. This essentially means I have to run the command in a notebook, during the CI/CD pipeline? – Saugat Mukherjee Sep 28 '21 at 07:30

2 Answers2

2

As we aware, An init script is a shell script that runs during startup of each cluster node before the Apache Spark driver or worker JVM start. case-2 When you run bash command by using of %sh magic command means you are trying to execute this command in Local driver node. So that workers nodes is not able to access . But based on case-1 , By using of %fs magic command you are trying run copy command (dbutils.fs.put )from root . So that along with driver node , other workers node also can access path .

Ref : https://docs.databricks.com/data/databricks-file-system.html#summary-table-and-diagram

  • Thanks for responding but I am finding it hard to wrap my head around this. Thing is after running the dbutils put command , I restart the cluster and even then it works. the output of the dbutils command is the same init sh file that I am using in the cluster. And I am restarting the cluster. How is it then reading it after the restart. Adding this in EDIT 2 of the question. – Saugat Mukherjee Sep 28 '21 at 07:05
0

It seems that my observations I made in the comments section of my question is the way to go.

I now create the init script using a databricks job that I run during the CI/CD pipeline from Azure DevOps.

The notebook has the commands

dbutils.fs.rm("/databricks/init-scripts/custom-cert.sh")
dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/internal-certificates/certs.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")

I then create a Databricks job (pointing to this notebook), the cluster is a job cluster which is just temporary . Of course , in my case , even this job creation is automated using a powershell script.

I then call this Databricks job in the release pipeline using again a Powershell script.

This creates the file

/databricks/init-scripts/custom-cert.sh

I then use this file in any other cluster that accesses my org's endpoints (without certificate errors).

I do not know (or still understand), why can't the same script file be just part of a repo and uploaded during the release process (instead of it being this Databricks job calling a notebook). I would love to know the reason . The other answer on this question does not hold true as you can see, that the cluster script is created by a job cluster and then accessed from another cluster as part of its init script.

It simply boils down to how the init script gets created.

But I get my job done. Just if it helps someone get their job done too. I have raised a support case though to understand the reason.

LATEST UPDATE: I found the answer earlier, but never got around to updating this here. The problem was the EOL. I was developing on Windows and using VS code and EOL was CRLF and not LF. The problem can be solved by having a .gitattributes file inside your repo and specify

*.sh eol=lf

This was you do not have to create the script using a databricks job run. You can simply have the script in your repo and upload it to dbfs and use it in your init scripts. Also, saves time because you do not need to run a job to create the script.

Hope this helps someone.

Saugat Mukherjee
  • 778
  • 8
  • 32