30

I'm trying to read a txt file from S3 with Spark, but I'm getting thhis error:

No FileSystem for scheme: s3

This is my code:

from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("first")
sc = SparkContext(conf=conf)
data = sc.textFile("s3://"+AWS_ACCESS_KEY+":" + AWS_SECRET_KEY + "@/aaa/aaa/aaa.txt")

header = data.first()

This is the full traceback:

An error occurred while calling o25.partitions.
: java.io.IOException: No FileSystem for scheme: s3
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:61)
    at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)

How can I fix this?

Filipe Ferminiano
  • 8,373
  • 25
  • 104
  • 174
  • 1
    Did you try `s3n://`? Also, have a look at: https://stackoverflow.com/questions/42754276/naive-install-of-pyspark-to-also-support-s3-access – andrew_reece Oct 14 '17 at 07:26
  • 1
    I tried, but I still get the same error. At took a look at the question is the same thing I'm already using. Im using 2.2 pyspark – Filipe Ferminiano Oct 14 '17 at 14:40
  • 1
    In Scala for access to S3 additional "hadoop-aws" libraries are required. Maybe, you need them too: https://gist.github.com/eddies/f37d696567f15b33029277ee9084c4a0 – pasha701 Oct 14 '17 at 19:59
  • I would suggest using some POSIX-compatible filesystem like http://juicefs.io that support s3 as a backend. You just need to mount the filesystem and then use it like a local directory, your code looks the same either in a local environment or some instance on cloud. – satoru Nov 17 '18 at 00:20

3 Answers3

14

If you are using a local machine you can use boto3:

s3 = boto3.resource('s3')
# get a handle on the bucket that holds your file
bucket = s3.Bucket('yourBucket')
# get a handle on the object you want (i.e. your file)
obj = bucket.Object(key='yourFile.extension')
# get the object
response = obj.get()
# read the contents of the file and split it into a list of lines
lines = response[u'Body'].read().split('\n')

(do not forget to setup your AWS S3 credentials).

Another clean solution if you are using an AWS Virtual Machine (EC2) would be granting S3 permissions to your EC2 and launching pyspark with this command:

pyspark --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2

If you add other packages, make sure the format is: 'groupId:artifactId:version' and the packages are separated by commas.

If you are using pyspark from Jupyter Notebooks this will work:

import os
import pyspark
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 pyspark-shell'
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext()
sqlContext = SQLContext(sc)
filePath = "s3a://yourBucket/yourFile.parquet"
df = sqlContext.read.parquet(filePath) # Parquet file read example
HagaiA
  • 193
  • 3
  • 15
gilgorio
  • 528
  • 6
  • 10
  • 2
    so boto3 doesn't work in cluster spark for some reason or what's the problem? I think I can wrap split boto command into DF, right? Otherwise I have same issue with spark 2.4.2 at moment – kensai Apr 25 '19 at 10:56
  • 1
    Just note this doesn't necessarily work, the version you load within pyspark script must be same as version in jars folder of your pyspark installation. My pyspark has in jars hadoo* files of 2.7.3, so I use it for submit args as well and now I can go simply with S3 (no a/n required) – kensai Apr 25 '19 at 11:50
  • 1
    Sorry, but why do you have spaces? Isn't the correct command `pyspark --packages=com.amazonaws:aws-java-sdk:1.11.7755,org.apache.hadoop:hadoop-aws:3.2.1` ? Note that this still fails for me with s3 and s3a urls. – rjurney May 06 '20 at 00:41
  • It worked for me with the space between the arg name and arg value – gilgorio May 08 '20 at 18:35
  • 7
    Hi when I tried `pyspark --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 ` it gave me error `[NOT FOUND ] junit#junit;4.11!junit.jar`, and the `No FileSystem for scheme: s3` error is still there, I'm using pyspark 2.4.4, any chance you know why? – wawawa Feb 05 '21 at 09:36
10

The above answers are correct regarding the need to specify Hadoop <-> AWS dependencies.

The answers do not include the newer versions of Spark, so I will post whatever worked for me, especially that it has changed as of Spark 3.2.x when spark upgraded to Hadoop 3.0.

Spark 3.0.3

  • --packages: org.apache.hadoop:hadoop-aws:2.10.2,org.apache.hadoop:hadoop-client:2.10.2
  • --exclude-packages: com.google.guava:guava

Spark 3.2+

now it's released together with spark, so use the same version as your spark

  • --packages: org.apache.spark:spark-hadoop-cloud_2.12:3.2.0

According to Spark documentation

According to spark documentation you should use the org.apache.spark:hadoop-cloud_2.12:<SPARK_VERSION> library. The problem with this is that this library does not exist in the central maven repository.

How to set extra dependencies?

in spark-submit

Use --packages and --exclude-packages parameters

in SparkSession.builder

Use spark.jars.packages and spark.jars.excludes spark configs

spark = (
    SparkSession
    .builder
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.10.2,org.apache.hadoop:hadoop-client:2.10.2")
    .config("spark.jars.excludes", "com.google.guava:guava")
    .getOrCreate()
)

s3 vs s3a

The above adds the S3AFileSystem to spark's classpath. When you set this spark config not only s3a://... but also the s3://... paths will work:

  • spark.hadoop.fs.s3.impl: org.apache.hadoop.fs.s3a.S3AFileSystem

set is using SparkSession.builder.config() of via --conf when using spark-submit

botchniaque
  • 4,698
  • 3
  • 35
  • 63
  • thank you! if you're using delta tables, make spark context with extra packages like this: `from delta import configure_spark_with_delta_pip; builder = pyspark.sql.SparkSession.builder.appName("MyApp"); spark = configure_spark_with_delta_pip(builder, extra_packages=["org.apache.spark:spark-hadoop-cloud_2.12:3.3.0", ]).getOrCreate()` – 123 Oct 13 '22 at 18:26
8

If you're using a jupyter notebook, you must two files to the class path for spark:

/home/ec2-user/anaconda3/envs/ENV-XXX/lib/python3.6/site-packages/pyspark/jars

the two files are :

  • hadoop-aws-2.10.1-amzn-0.jar
  • aws-java-sdk-1.11.890.jar
welkinwalker
  • 2,062
  • 3
  • 18
  • 21