1

I am trying to read data from s3 bucket in pyspark code and I am using jupyter notebook. I have Spark set up on my machine and using it in jupyter by importing findspark

import findspark
findspark.init()

from pyspark.sql import *

spark = SparkSession.builder.appName("my_app").getOrCreate()

But when I try to read the data from bucket, I am getting the error java.io.IOException: No FileSystem for scheme: s3.

input_bucket = "s3://bucket_name"
data = spark.read.csv(input_bucket + '/file_name', header=True, inferSchema=True)

I found some solutions on the internet that says to add these 2 packages (hadoop-aws and aws-java-sdk). I downloaded and added these jar files in the jars folder of Spark but still I am getting the same error.

I don't know whether it is the issue of compatibility of versions or is there any other problem. If it is a compatibility issue, how can one decide which version of jar files to use according to our pyspark, python and java version?

Versions

pyspark 2.4.8
python 3.7.9
java version "1.8.0_301"
Java(TM) SE Runtime Environment (build 1.8.0_301-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.301-b09, mixed mode)
javac 1.8.0_301

Jar Files

hadoop-aws-2.7.3.jar
aws-java-sdk-1.8.2.jar

PS: I am working on WIndows.

Malina Dale
  • 153
  • 1
  • 3
  • 8

2 Answers2

3

A lot more goes under the hood to achieve this amalgamation between java and python within spark.

Primarily its a version compatibility issue between the different jars. Ensuring consistency towards different components can be your starting point to tackle issues like these

Hadoop Version

Navigate to the location where spark is installed , ensuring consistent versions for *hadoop* is the first step towards spark

[ vaebhav@localhost:/usr/local/Cellar/apache-spark/3.1.2/libexec/jars - 10:39 PM ]$ ls -lthr *hadoop-*
-rw-r--r--  1 vaebhav  root    79K May 24 10:15 hadoop-yarn-server-web-proxy-3.2.0.jar
-rw-r--r--  1 vaebhav  root   1.3M May 24 10:15 hadoop-yarn-server-common-3.2.0.jar
-rw-r--r--  1 vaebhav  root   221K May 24 10:15 hadoop-yarn-registry-3.2.0.jar
-rw-r--r--  1 vaebhav  root   2.8M May 24 10:15 hadoop-yarn-common-3.2.0.jar
-rw-r--r--  1 vaebhav  root   310K May 24 10:15 hadoop-yarn-client-3.2.0.jar
-rw-r--r--  1 vaebhav  root   3.1M May 24 10:15 hadoop-yarn-api-3.2.0.jar
-rw-r--r--  1 vaebhav  root    84K May 24 10:15 hadoop-mapreduce-client-jobclient-3.2.0.jar
-rw-r--r--  1 vaebhav  root   1.6M May 24 10:15 hadoop-mapreduce-client-core-3.2.0.jar
-rw-r--r--  1 vaebhav  root   787K May 24 10:15 hadoop-mapreduce-client-common-3.2.0.jar
-rw-r--r--  1 vaebhav  root   4.8M May 24 10:15 hadoop-hdfs-client-3.2.0.jar
-rw-r--r--  1 vaebhav  root   3.9M May 24 10:15 hadoop-common-3.2.0.jar
-rw-r--r--  1 vaebhav  root    43K May 24 10:15 hadoop-client-3.2.0.jar
-rw-r--r--  1 vaebhav  root   136K May 24 10:15 hadoop-auth-3.2.0.jar
-rw-r--r--  1 vaebhav  root    59K May 24 10:15 hadoop-annotations-3.2.0.jar
-rw-r--r--@ 1 vaebhav  root   469K Oct  9 00:30 hadoop-aws-3.2.0.jar
[ vaebhav@localhost:/usr/local/Cellar/apache-spark/3.1.2/libexec/jars - 10:39 PM ]$ 

For Further 3rd party connectivity like S3 , you can check the corresponding compile dependency from MVN Repository by searching for the respective jar , in your case - hadoop-aws-2.7.3.jar

MVN Compile Dependency

By searching the respective artifact under mvn repository , one should check the respective aws jdk jar under compile dependency

enter image description here

enter image description here

These check points can be your entry point to ensure correct dependencies are ensured

After the dependencies are sorted , there are additional steps for S3 connectivity

PySpark S3 Example

Currently AWS SDK supports s3a or s3n , I have demonstrated how to establish s3a, the later one is fairly easy to implement as well

Difference between the implementations can be found in this brilliant answer

from pyspark import SparkContext
from pyspark.sql import SQLContext
import configparser

sc = SparkContext.getOrCreate()
sql = SQLContext(sc)

hadoop_conf = sc._jsc.hadoopConfiguration()

config = configparser.ConfigParser()

config.read(os.path.expanduser("~/.aws/credentials"))

access_key = config.get("<aws-account>", "aws_access_key_id")
secret_key = config.get("<aws-account>", "aws_secret_access_key")
session_key = config.get("<aws-account>", "aws_session_token")

sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com")
sc._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "true")
sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");

hadoop_conf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
hadoop_conf.set("fs.s3a.access.key", access_key)
hadoop_conf.set("fs.s3a.secret.key", secret_key)
hadoop_conf.set("fs.s3a.session.token", session_key)

s3_path = "s3a://<s3-path>/"

sparkDF = sql.read.parquet(s3_path)
Vaebhav
  • 4,672
  • 1
  • 13
  • 33
  • this is a wonderful answer. But do cut the bits the conf where fs.s3a.impl is set...that's just some folklore copied and pasted across stack overflow answers – stevel Oct 27 '21 at 17:52
  • Is there a better way to establish connectivity , the followed the procedure mentioned - https://github.com/databricks/spark-redshift#authenticating-to-s3-and-redshift – Vaebhav Oct 27 '21 at 17:55
  • that db docs are specific for their product. The s3a connector ships with endpoint, connection and fs.s3a impl set *out the box* (in core-default.xml in hadoop-common; and the temporary credential provider is first in the list of cred providers (followed by : full creds, env vars, EC2 IAM secrets). Strip out all of the _sc._jsc settings and all will work – stevel Oct 28 '21 at 11:29
  • @Vaebhav I am facing this same issue - java.io.IOException: No FileSystem for scheme: s3 I am trying to connect hive from java spark. Will this same approach work for hive connectivity? .enableHiveSupport() – Prakash Raj Nov 16 '21 at 02:42
  • Yes the same approach is valid across hive as well – Vaebhav Mar 02 '22 at 08:12
0

Apache hadoop cut its original s3 connector in 2016, HADOOP-12709. It was never compatible with the EMR fs s3:// URL filesystem anyway, which it predated by a number of years.

People should use s3a:// URLs with ASF spark releases, with an up to date and consistent set of hadoop-* JARs as well the version of aws-sdk-bundle which that hadoop release ships with. Try mixing things and all you will see are stack traces. If things don't work, the best starting place for troubleshooting is actually the document troubleshooting s3a whose keys features are (1) it was written by the people who wrote the code and therefore considered normative, (2) it is maintained.

In contrast most stack overflow answers are out of date within a few months of being posted as they are rarely maintained, and were not necessarily correct at the time of posting either.

stevel
  • 12,567
  • 1
  • 39
  • 50