1

I am trying to access a s3:// path with

spark.read.parquet("s3://<path>")

And I get this error

Py4JJavaError: An error occurred while calling o31.parquet. : java.io.IOException: No FileSystem for scheme: s3

However, running the following line

hadoop fs -ls <path>

Does work...

So I guess this might be a configuration issue between hadoop and spark

How can this be solved ?

EDIT

After reading the suggested answer, I've tried adding the jars hard coded to the spark config, with no success

spark = SparkSession\
.builder.master("spark://" + master + ":7077")\
.appName("myname")\
.config("spark.jars", "/usr/share/aws/aws-java-sdk/aws-java-sdk-1.11.221.jar,/usr/share/aws/aws-java-sdk/hadoop-aws.jar")\
.config("spark.jars.packages", "com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2")\
.getOrCreate()

No success

Uri Goren
  • 13,386
  • 6
  • 58
  • 110
  • 1
    Looks like "s3://" is deprecated, can you please try "s3a://" or "s3n://" – Anurag Sharma Jan 17 '18 at 17:24
  • same behavior for `s3n` and `s3a` – Uri Goren Jan 17 '18 at 19:36
  • Is this Spark on EMR or your custom installation on ec2/emr ? Spark on EMR should have no problem accessing S3:// prefix by default unless you messed up classpath's or deleted jars etc. In fact it will invoke EMRFS file system for s3:// or s3n:// prefix. – jc mannem Jan 22 '18 at 20:14

1 Answers1

1

Hadoop aws dependency is missing in your project. Please add hadoop-aws in your build.

Ravikumar
  • 1,121
  • 1
  • 12
  • 23
  • It's a pyspark application, there's no build. How can I add this dependency when using an interactive shell ? – Uri Goren Jan 17 '18 at 19:37
  • Please use as follows pyspark --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2. The other way is that add the "spark.jars.packages" property in spark-defaults.conf file. spark.jars.packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2 – Ravikumar Jan 17 '18 at 21:38