adding jar driver to emr-6.7.0 spark

Question

I'm trying to connect to aws redis cluster from an emr cluster, I uploaded the jar driver to s3 and used this bootstrap action to copy the jar file to the cluster nodes:

    aws s3 cp s3://sparkbcuket/spark-redis-2.3.0.jar /home/hadoop/spark-redis-2.3.0.jar

This is my connection test spark app:

import sys
from pyspark.sql import SparkSession

if __name__ == "__main__":
    spark = SparkSession.builder\
    .config("spark.redis.host", "testredis-0013.vb4vgr.00341.eu1.cache.amazonaws.com")\
    .config("spark.redis.port", "6379")\
    .appName("Redis_test").getOrCreate()

    df = spark.read.format("org.apache.spark.sql.redis").option("key.column", "key").option("keys.pattern","*").load()

    df.write.csv(path='s3://sparkbucket/',sep=',')
    
    spark.stop()

when runing the application using this spark-submit :

spark-submit --deploy-mode cluster --driver-class-path /home/hadoop/spark-redis-2.3.0.jar s3://sparkbucket/testredis.py

i get the following error and not sure what i did wrong:

ERROR Client: Application diagnostics message: User application exited with status 1 Exception in thread "main" org.apache.spark.SparkException: Application application_1658168513779_0001 finished with failed status

This is just a warning. Does the app actually crash (log line with level ERROR)? — bzu, Jul 18 '22 at 19:01
this is the error from log :ERROR Client: Application diagnostics message: User application exited with status 1 Exception in thread "main" org.apache.spark.SparkException: Application application_1658168513779_0001 finished with failed status — billie class, Jul 18 '22 at 19:04
Are you getting these logs from S3? There should be some more detailed error info somewhere. Maybe this helps: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-web-log-files.html#emr-manage-view-web-log-files-s3 — bzu, Jul 18 '22 at 19:11
in the spark UI it show nothing under stage and also in the log files this is the only error i could find! — billie class, Jul 18 '22 at 19:26
The problem that you are facing is that once you add the `--driver-class-path` you overwrite the original class-path of the EMR spark. What you need to do is to get the `driver-class-path` and append to the end your new jar. — Thiago Baldim, Jul 19 '22 at 05:55
does this help? -- https://stackoverflow.com/questions/29099115/spark-submit-add-multiple-jars-in-classpath -- see [this AWS tutorial](https://aws.amazon.com/pt/premiumsupport/knowledge-center/emr-permanently-install-library/) also — samkart, Jul 19 '22 at 14:01

yjsa · Accepted Answer · 2022-07-25T13:48:38.113

With similar test code, I successfully run by uploading the spark-redis jar in S3 and used --jars as arg as follows:

spark-submit --deploy-mode cluster --jars s3://<bucket/path>/spark-redis_2.12-3.1.0-SNAPSHOT-jar-with-dependencies.jar s3://<bucket/path>/redis_test.py

The detailed log for the run can be viewed in the Spark history server. This can be accessed in the EMR web console following this sequence of links:

Summary -> Spark history server -> application_xxx_xxx -> Executors -> (driver)stdout

You'll get NoSuchKey error as it will take some time for the log to be available, just reload.

adding jar driver to emr-6.7.0 spark

1 Answers1