I was facing an error: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
and stumbled upon the solution here which works.
However, in the note given right after the answer, the author states the following:
com.amazonaws:aws-java-sdk-pom:1.11.760 : depends on jdk version hadoop:hadoop-aws:2.7.0: depends on your hadoop version s3.us-west-2.amazonaws.com: depends on your s3 location
So, when I run the following command:
pyspark --packages com.amazonaws:aws-java-sdk-pom:1.8.0_242,org.apache.hadoop:hadoop-aws:2.8.5
I face the following error:
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: com.amazonaws#aws-java-sdk-pom;1.8.0_242: not found]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1302)
at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:304)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/pyspark/shell.py", line 38, in <module>
SparkContext._ensure_initialized()
File "/opt/app-root/lib/python3.6/site-packages/pyspark/context.py", line 316, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/opt/app-root/lib/python3.6/site-packages/pyspark/java_gateway.py", line 46, in launch_gateway
return _launch_gateway(conf)
File "/opt/app-root/lib/python3.6/site-packages/pyspark/java_gateway.py", line 108, in _launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
The reason I changed the command is as follows:
- JDK version:
(app-root) java -version
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)
- Pyspark version:
2.4.5
- Hadoop Version:
2.8.5
How can I resolve this error and start a pyspark shell with the correct dependencies in order to read the files from S3?