How to add functions from custom JARs to EMR cluster?

Question

I created an EMR cluster on AWS with Spark and Livy. I submitted a custom JAR with some additional libraries (e.g. datasources for custom formats) as a custom JAR step. However, the stuff from the custom JAR is not available when I try to access it from Livy.

What do I have to do to make the custom stuff available in the environment?

In my Spark job, I add it as a dependency and `sbt assembly` packs it into the fat JAR. I want to include a library my colleagues can use when they use Spark with Livy. — rabejens, Jun 19 '19 at 11:30
You need to make sure you add it to their jobs via `spark.driver.extraClassPath` and `spark.executor.extraClassPath` properties to the `spark-submit`. — Yuval Itzchakov, Jun 19 '19 at 11:33
Ah, so I should handle this via the classifications JSON I can supply when creating the cluster? — rabejens, Jun 19 '19 at 11:34
Yes, I believe so. I don't remember the specific details but you can definitely provide custom configuration. — Yuval Itzchakov, Jun 19 '19 at 11:56
I am currently trying to use bootstrap actions to copy my library to the nodes in conjunction with configuration classifications. Let's see if that works. — rabejens, Jun 19 '19 at 11:58

score 9 · Accepted Answer · answered Jun 19 '19 at 12:45

I am posting this as an answer to be able to accept it - I figured it out thanks to Yuval Itzchakov's comments and the AWS documentation on Custom Bootstrap Actions.

So here is what I did:

I put my library jar (a fat jar created with sbt assembly containing everything needed) into an S3 bucket

Created a script named copylib.sh which contains the following:

#!/bin/bash

mkdir -p /home/hadoop/mylib
aws s3 cp s3://mybucket/mylib.jar /home/hadoop/mylib

Created the following configuration JSON and put it into the same bucket besides the mylib.jar and copylib.sh:

[{
   "configurations": [{
       "classification": "export",
       "properties": {
           "PYSPARK_PYTHON": "/usr/bin/python3"
       }
   }],
   "classification": "spark-env",
   "properties": {}
}, {
   "configurations": [{
       "classification": "export",
       "properties": {
           "PYSPARK_PYTHON": "/usr/bin/python3"
       }
   }],
   "classification": "yarn-env",
   "properties": {}
},
{
   "Classification": "spark-defaults",
   "Properties": {
       "spark.executor.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/mylib/mylib.jar",
       "spark.driver.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/mylib/mylib.jar"
   }
}
]

The classifications for spark-env and yarn-env are needed for PySpark to work with Python3 on EMR through Livy. And there is another issue: EMR already populates the two extraClassPaths with a lot of libraries which are needed for EMR to function properly, so I had to run a cluster without my lib, extract these settings from spark-defaults.conf and adjust my classification afterwards. Otherwise, things like S3 access wouldn't work.

When creating the cluster, in Step 1 I referenced the configuration JSON file from above in Edit software settings, and in Step 3, I configured copylib.sh as a Custom Bootstrap Action.

I can now open the Jupyterhub of the cluster, start a notebook and work with my added functions.

score 1 · Answer 2 · answered Jan 14 '22 at 03:36

I use an alternative way that does not use a bootstrap action.

Place the JARs in S3
Pass them in the --jars option of spark-submit eg. spark-submit --jars s3://my-bucket/extra-jars/*.jar. All the jars will be copied to the cluster.

This way we can use any jar from s3 if we missed to add bootstrap action during cluster creation.

How to add functions from custom JARs to EMR cluster?

2 Answers2