2

I have few Jar files/packages in the DBFS and I want an init script (so that I can place that in the automated cluster) to install the Jar package everytime the cluster starts.

I also want to install maven packages from maven using an init script.

I can do all of these using databricks UI. But the requirement is to install libraries using an init script.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
user1860447
  • 1,316
  • 8
  • 25
  • 46

1 Answers1

4

To install jar files, just put files onto DBFS, in some location, and in the init script do:

cp /dbfs/<some-location>/*.jar /databricks/jars/

Installation of the maven dependencies is more tricky, because you also will need to fetch dependencies. But it's doable - from the init script:

  • Download and unpack Maven
  • Execute:
mvn dependency:get -Dartifact=<maven_coordinates>
  • move downloaded jars:
find ~/.m2/repository/ -name \*.jar -print0|xargs -0 mv -t /databricks/jars/
  • (optional) remove not necessary directory:
rm -rf ~/.m2/

P.S. But really, I recommend to automate such stuff via Databricks Terraform Provider.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • 1
    I used wget to retrieve jars from mvn and placed them in /databricks/jars/ – user1860447 Mar 08 '21 at 19:01
  • wget won't handle transitive dependencies... Also, take into account that by using `/dbfs/` you may have some problems, so you check if your init scripts executed correctly – Alex Ott Mar 08 '21 at 19:16
  • @AlexOtt Does jar needs to have a specific name once they are in /databricks/jars/ on the driver (for example a single node) ? – Arthur Clerc-Gherardi Nov 23 '21 at 10:01
  • I'm having trouble to install gresearch and use it in my python notebooks – Arthur Clerc-Gherardi Nov 23 '21 at 10:02
  • 1
    no, they just need to have `.jar` extension – Alex Ott Nov 23 '21 at 10:17
  • That what I was thinking. Weird that nothing is detected only for jars but the jars are on the driver (using web terminal). I'll post something if I don't manage it :), thanks ! – Arthur Clerc-Gherardi Nov 23 '21 at 10:19
  • I managed it for other jars but for this library I can't make it work: https://github.com/G-Research/spark-extension. I tried your maven solution (adding some execution user rights too) but it didn't work neither. If you have any idea ? – Arthur Clerc-Gherardi Nov 23 '21 at 11:02
  • check that you have version for corresponding Spark version – Alex Ott Nov 23 '21 at 18:20
  • I thought it was that at first too. I'm using: spark-extension_2.12-2.0.0-3.2.jar, scala 2.12 and spark 3.2, on a cluster with DBR 10.1 (Spark 3.2 - Scala 2.12). Is there any command I could run on my cluster to check if the jar can be used ? It maybe needs other libraries to run :/ – Arthur Clerc-Gherardi Nov 25 '21 at 13:43
  • 1
    just print a classpath - obtain it from the system properties – Alex Ott Nov 25 '21 at 14:12
  • I can find it in the classpath, so it should be known. Mmmh.... – Arthur Clerc-Gherardi Nov 25 '21 at 14:36
  • What is the command to execute some spark code on a node ? Obviously it doesn't know pyspark or sbt. Should I use something like python -m pyspark XXX ? I'd like to point directly to my jar to see if I can have an error more explicit – Arthur Clerc-Gherardi Nov 25 '21 at 14:38
  • 1
    As a shortcut for the three steps `mvn dependency:get`, `find`, and `rm`, you could do `mvn dependency:copy -Dartifact= -DoutputDirectory=/databricks/jars/` – Tom Oct 05 '22 at 12:31
  • Really, it's not as simple because we need to remove not necessary jars, like, Spark, Scala, etc. – Alex Ott Oct 05 '22 at 13:34
  • @AlexOtt You mention above *"wget won't handle transitive dependencies... Also, take into account that by using /dbfs/ you may have some problems, so you check if your init scripts executed correctly"* but don't specify an alternative in your answer here. What do you recommend if not wget? – iamdave Feb 03 '23 at 17:01
  • No alternatives really when using wget… you can use maven to fetch dependencies – Alex Ott Feb 03 '23 at 19:07
  • Is this `/databricks/jars` officially documented anywhere? – MMarshall Mar 21 '23 at 09:28