How can I read from S3 in pyspark running in local mode?

Question

I am using PyCharm 2018.1 using Python 3.4 with Spark 2.3 installed via pip in a virtualenv. There is no hadoop installation on the local host, so there is no Spark installation (thus no SPARK_HOME, HADOOP_HOME, etc.)

When I try this:

from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()\
    .setMaster("local")\
    .setAppName("pyspark-unittests")\
    .set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")

I get:

py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.
: java.io.IOException: No FileSystem for scheme: s3

How can I read from s3 while running pyspark in local mode without a complete Hadoop install locally?

FWIW - this works great when I execute it on an EMR node in non-local mode.

The following does not work (same error, although it does resolve and download the dependancies):

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:3.1.0" pyspark-shell'
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()\
    .setMaster("local")\
    .setAppName("pyspark-unittests")\
    .set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")

Same (bad) results with:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars "/path/to/hadoop-aws-3.1.0.jar" pyspark-shell'
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()\
    .setMaster("local")\
    .setAppName("pyspark-unittests")\
    .set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")

Possible duplicate of [How can I access S3/S3n from a local Hadoop 2.6 installation?](https://stackoverflow.com/questions/28029134/how-can-i-access-s3-s3n-from-a-local-hadoop-2-6-installation) — Alper t. Turker, May 04 '18 at 22:53
There is no local Hadoop installation in this case - just Spark installed in the virtualenv via pip. — Jared, May 06 '18 at 18:02
Try to use `s3a` [protocol](https://community.hortonworks.com/articles/36339/spark-s3a-filesystem-client-from-hdp-to-access-s3.html): inputFile = sparkContext.textFile("s3a://somebucket/file.csv") — prudenko, May 08 '18 at 10:44

Tarun Lalwani · Accepted Answer · 2018-05-08T21:21:27.503

10

So Glennie's answer was close but not what would work in your case. The key thing was to select the right version of the dependencies. If you look at the virtual environment

Everything points to one version which 2.7.3, which what you also need to use

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell'

You should verify the version that your installation using by checking the path venv/Lib/site-packages/pyspark/jars inside your project's virtual env

And after that you can use s3a by default or s3 by defining the handler class for the same

# Only needed if you use s3://
sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')
s3File = sc.textFile("s3a://myrepo/test.csv")

print(s3File.count())
print(s3File.id())

And the output is below

edited May 08 '18 at 21:21

answered May 08 '18 at 21:16

Tarun Lalwani

142,312
9
204
265

What's the hierarchy in the Project pane to get to that "hadoop-common-2.7.3"? (So that I can check and make sure I have the same one). – Jared May 08 '18 at 21:18
On my laptop the relative path in environment is `venv/Lib/site-packages/pyspark/jars` – Tarun Lalwani May 08 '18 at 21:19
Very close. I'm now getting "com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain." We don't support permanent awsKey/awsSecret pairs, so I need to figure out how to get this to work with either a sessionToken or com.amazonaws.auth.profile.ProfileCredentialsProvider, where I'll have a session token in my creds file. Any hints? Or want me to open another question on that one? – Jared May 08 '18 at 21:44
I think that is a new problem, so best sorted out on a new question. You post the link here and will try and help – Tarun Lalwani May 08 '18 at 21:45
Here's the new question - thanks for any help you can give! https://stackoverflow.com/questions/50242843/how-do-i-use-an-aws-sessiontoken-to-read-from-s3-in-pyspark – Jared May 08 '18 at 22:00
Here's another one I would love your help with :): https://stackoverflow.com/questions/50243130/how-do-i-update-the-java-keystore-used-by-pyspark-running-in-pycharm-on-mac – Jared May 08 '18 at 22:31
1

should be `"fs.s3a.impl"` instead of `"fs.s3.impl"` – Cyzanfar Sep 17 '19 at 03:52

score 3 · Answer 2 · answered May 08 '18 at 17:54

3

You should use the s3a protocol when accessing S3 locally. Make sure you add your key and secret to the SparkContext first. Like this:

sc = SparkContext(conf = conf)
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')

inputFile = sparkContext.textFile("s3a://somebucket/file.csv")

answered May 08 '18 at 17:54

Glennie Helles Sindholt

12,816
5
44
50

1

With the --packages in the question, this gets closer..... I now get "java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities". Seems like I'm still missing some dependencies. – Jared May 08 '18 at 21:04
If I also add "org.apache.hadoop:hadoop-common:3.1.0" to the packages, I get "java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()"....so probably just need to find the correct version of hadoop-common (I think) – Jared May 08 '18 at 21:13

score 1 · Answer 3 · answered Nov 13 '18 at 01:52

preparation:

Add following lines to your spark config file, for my local pyspark, it is /usr/local/spark/conf/spark-default.conf

spark.hadoop.fs.s3a.access.key=<your access key>
spark.hadoop.fs.s3a.secret.key=<your secret key>

python file content:

from __future__ import print_function
import os

from pyspark import SparkConf
from pyspark import SparkContext

os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"


if __name__ == "__main__":

    conf = SparkConf().setAppName("read_s3").setMaster("local[2]")
    sc = SparkContext(conf=conf)

    my_s3_file3 = sc.textFile("s3a://store-test-1/test-file")
    print("file count:", my_s3_file3.count())

commit:

spark-submit --master local \
--packages org.apache.hadoop:hadoop-aws:2.7.3,\
com.amazonaws:aws-java-sdk:1.7.4,\
org.apache.hadoop:hadoop-common:2.7.3 \
<path to the py file above>

How can I read from S3 in pyspark running in local mode?

3 Answers3

Linked