0

maybe someone knows an easier way to do this. I am running an EMR cluster (6.x) (1 Master, 1 Slave) with Spark (3.x) on it. Trying to write some data to mysql RDS with a spark job.

spark-submit --jars s3://s3-bucket-jar-assignment/mysql-connector-java-8.0.25.jar s3://s3-bucket-scripts-assignment/scripts/pyspark_script.py

I get this error: I have to mention that I have not installed the jar on the master. How do I do that if I have the jar on an s3 bucket?

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/share/aws/emr/emrfs/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/share/aws/redshift/jdbc/redshift-jdbc42-1.2.37.1061.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
  File "/mnt/tmp/spark-079b5158-31f7-419b-9877-0e557b9aa612/pyspark_script.py", line 11
    .config(conf=SparkConf()).getOrCreate()
    ^
IndentationError: unexpected indent
21/12/18 20:58:16 INFO ShutdownHookManager: Shutdown hook called
21/12/18 20:58:16 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-079b5158-31f7-419b-9877-0e557b9aa612
QBits
  • 121
  • 1
  • 11

1 Answers1

1

Seems like the issue is not related to the package/ jar. its indentation and since I haven't seen your full code of the spark initialisation I can just guess is a break line without add the \ sign which indicates python interperter that you break the line and code continues... Another option is to put everything within your code with ( you code )

so for example

spark = SparkSession.builder
.config(....)

instead of

spark = sparkSession.builder.config(...)

or if you want to break the line:

spark = (
  SparkSession.builder
.config(....)
)
Benny Elgazar
  • 243
  • 2
  • 9
  • Yes. Thanks. That was the problem (so far). I will mark this answered. But I still get the error "connection refused" . I have opened a separate question here. Maybe you can look at it. https://stackoverflow.com/questions/70411263/connection-error-while-writing-dataframe-pyspark-3-x-on-emr-6-x-to-rds-mysql – QBits Dec 19 '21 at 12:10