Read csv and parquet files from S3 using Pyspark

Question

The requirement is to load csv and parquet files from S3 into a dataframe using PySpark.

The code I'm using is :

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
conf = SparkConf()
appName = "S3"
master = "local"

conf.set('spark.executor.extraJavaOptions', '-Dcom.amazonaws.services.s3.enableV4=true')
conf.set('spark.driver.extraJavaOptions', '-Dcom.amazonaws.services.s3.enableV4=true')
sc = SparkContext.getOrCreate(conf=conf)
sc.setSystemProperty('com.amazonaws.services.s3.enableV4', 'true')

hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set('fs.s3a.access.key', aws_access_key_id)
hadoopConf.set('fs.s3a.secret.key', aws_secret_access_key)
hadoopConf.set('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
spark = SparkSession(sc)

df = spark.read.csv('s3://s3path/File.csv')

And it gives me the error :

py4j.protocol.Py4JJavaError: An error occurred while calling o34.csv.
: java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException

And similar error while reading Parquet files :

py4j.protocol.Py4JJavaError: An error occurred while calling o34.parquet.
: java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException

How to resolve this?

Does this answer your question? [Caused by: java.lang.ClassNotFoundException: org.jets3t.service.ServiceException](https://stackoverflow.com/questions/43917378/caused-by-java-lang-classnotfoundexception-org-jets3t-service-serviceexception) — stevel, Oct 31 '22 at 11:19

score 0 · Answer 1 · answered Oct 31 '22 at 11:04

If something is lookign for jets3t you are using a historically out of date hadoop release which actually supports s3:// urls
Upgrade to a version of spark with hadoop-3.3.4 or later binaries (whatever is the latest release at the time of reading)
include the exact same aws-sdk-bundle jar the hadoop-aws jar depends on in its build.
remove that hadoopConf.set('fs.s3a.impl', ... line as that is a weird superstition passed down by stack overflow posts. note how neither the spark nor hadoop documentation examples use it, and consider that authors there knew of what they wrote.
then use s3a:// URLs

Vaebhav · Answer 2 · 2022-10-31T11:29:18.237

-1

You are missing an Hadoop Client dependency - Caused by: java.lang.ClassNotFoundException: org.jets3t.service.ServiceException

Note - Support for AWS S3:// has been deleted in 2016 , as noted by stevel , you should opt for the latest s3a which you can refer the below link for setup

You need to ensure additional dependent libraries are present before you attempt to read data sources from S3

You can refer this answer as a reference - java.io.IOException: No FileSystem for scheme: s3 to setup your enviornment accordingly

edited Oct 31 '22 at 11:29

answered Oct 30 '22 at 07:49

Vaebhav

4,672
1
13
33

hadoop aws s3:// support was deleted in 2016 -eight years ago. if someone wants to talk to s3 through asf spark (not EMR), you need to point them at recent hadoop binaries, rather help them get more lost in obsolete code which isn't actually compatible with emr s3:// URLs – stevel Oct 31 '22 at 11:07
@stevel - The intention was not to misguide the folks here , but rather provide the latest approach towards mitigating the dependency error. But noted should have been more vigilant in providing the answer. I ll update the section that has been deprecated from the answer – Vaebhav Oct 31 '22 at 11:12
thx. i just flagged it as a duplicate. people should learn to search for error messages – stevel Oct 31 '22 at 11:20
Agreed - Modified my answer as well – Vaebhav Oct 31 '22 at 11:29

Read csv and parquet files from S3 using Pyspark

2 Answers2