5

I created an EMR cluster on AWS with Spark and Livy. I submitted a custom JAR with some additional libraries (e.g. datasources for custom formats) as a custom JAR step. However, the stuff from the custom JAR is not available when I try to access it from Livy.

What do I have to do to make the custom stuff available in the environment?

rabejens
  • 7,594
  • 11
  • 56
  • 104
  • Is it available in your Spark job? – Yuval Itzchakov Jun 19 '19 at 11:27
  • In my Spark job, I add it as a dependency and `sbt assembly` packs it into the fat JAR. I want to include a library my colleagues can use when they use Spark with Livy. – rabejens Jun 19 '19 at 11:30
  • You need to make sure you add it to their jobs via `spark.driver.extraClassPath` and `spark.executor.extraClassPath` properties to the `spark-submit`. – Yuval Itzchakov Jun 19 '19 at 11:33
  • Ah, so I should handle this via the classifications JSON I can supply when creating the cluster? – rabejens Jun 19 '19 at 11:34
  • Yes, I believe so. I don't remember the specific details but you can definitely provide custom configuration. – Yuval Itzchakov Jun 19 '19 at 11:56
  • I am currently trying to use bootstrap actions to copy my library to the nodes in conjunction with configuration classifications. Let's see if that works. – rabejens Jun 19 '19 at 11:58

2 Answers2

9

I am posting this as an answer to be able to accept it - I figured it out thanks to Yuval Itzchakov's comments and the AWS documentation on Custom Bootstrap Actions.

So here is what I did:

  1. I put my library jar (a fat jar created with sbt assembly containing everything needed) into an S3 bucket
  2. Created a script named copylib.sh which contains the following:

    #!/bin/bash
    
    mkdir -p /home/hadoop/mylib
    aws s3 cp s3://mybucket/mylib.jar /home/hadoop/mylib
    
  3. Created the following configuration JSON and put it into the same bucket besides the mylib.jar and copylib.sh:

    [{
       "configurations": [{
           "classification": "export",
           "properties": {
               "PYSPARK_PYTHON": "/usr/bin/python3"
           }
       }],
       "classification": "spark-env",
       "properties": {}
    }, {
       "configurations": [{
           "classification": "export",
           "properties": {
               "PYSPARK_PYTHON": "/usr/bin/python3"
           }
       }],
       "classification": "yarn-env",
       "properties": {}
    },
    {
       "Classification": "spark-defaults",
       "Properties": {
           "spark.executor.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/mylib/mylib.jar",
           "spark.driver.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/mylib/mylib.jar"
       }
    }
    ]
    

    The classifications for spark-env and yarn-env are needed for PySpark to work with Python3 on EMR through Livy. And there is another issue: EMR already populates the two extraClassPaths with a lot of libraries which are needed for EMR to function properly, so I had to run a cluster without my lib, extract these settings from spark-defaults.conf and adjust my classification afterwards. Otherwise, things like S3 access wouldn't work.

  4. When creating the cluster, in Step 1 I referenced the configuration JSON file from above in Edit software settings, and in Step 3, I configured copylib.sh as a Custom Bootstrap Action.

I can now open the Jupyterhub of the cluster, start a notebook and work with my added functions.

rabejens
  • 7,594
  • 11
  • 56
  • 104
1

I use an alternative way that does not use a bootstrap action.

  1. Place the JARs in S3
  2. Pass them in the --jars option of spark-submit eg. spark-submit --jars s3://my-bucket/extra-jars/*.jar. All the jars will be copied to the cluster.

This way we can use any jar from s3 if we missed to add bootstrap action during cluster creation.

arunvelsriram
  • 1,036
  • 8
  • 18