How to install .jar dependency in EMR notebooks?

Question

I am running an EMR notebook (plateform: AWS, notebook: jupyter, kernel: PySpark). I need to install a .jar dependency (sparkdl) to proceed some images.

Using Spark-submit, I can use:

spark-submit --packages databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11

Using a local notebook, I can use:

spark = (SparkSession
            .config('spark.jars.packages', 'databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11')
            .getOrCreate()
)

But how to do the same thing on an EMR notebook ?

Either I could use a bootstrap to install it on every nodes. But I don't know how to proceed…
Either I could configure the SparkSession to use the dependency. But the notebook seems to not being able to reach the repository… Also I don't know the syntax to make it load the file copied on the S3 bucket…

EDIT: I tried

%%configure -f
{ "conf":{
          "spark.jars": "s3://p8-fruits/libs/spark-deep-learning-1.5.0-spark2.4-s_2.11.jar"
         }
}

This did not threw any error, but I am still not able to use it. When I try import sparkdl, I got ModuleNotFoundError: No module named 'sparkdl'.

Thank you very much for your help !

This thread might help: https://stackoverflow.com/questions/27698111/how-to-add-third-party-java-jars-for-use-in-pyspark — Bitswazsky, Dec 19 '19 at 05:45

score 0 · Answer 1 · answered Feb 24 '21 at 11:27

First, you can declare dependencies within spark.jars.packages directive in configure magic:

%%configure
{ 
    "conf": {
        "spark.jars.packages": "databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11" 
    }
}

This should be enough for common cases. If your EMR cluster has no access to jar repositories, you might want to manually put jars to HDFS (assuming, you have your jar in /home/hadoop/libs/)

e.g.

hdfs dfs -put /home/hadoop/libs/spark-deep-learning-1.5.0-spark2.4-s_2.11.jar /libs/spark-deep-learning-1.5.0-spark2.4-s_2.11.jar

And pass it within jars directive:

%%configure -f
{ 
    "jars": ["/libs/spark-deep-learning-1.5.0-spark2.4-s_2.11.jar"]
}

How to install .jar dependency in EMR notebooks?

1 Answers1