The requirement is to load csv and parquet files from S3 into a dataframe using PySpark.
The code I'm using is :
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
conf = SparkConf()
appName = "S3"
master = "local"
conf.set('spark.executor.extraJavaOptions', '-Dcom.amazonaws.services.s3.enableV4=true')
conf.set('spark.driver.extraJavaOptions', '-Dcom.amazonaws.services.s3.enableV4=true')
sc = SparkContext.getOrCreate(conf=conf)
sc.setSystemProperty('com.amazonaws.services.s3.enableV4', 'true')
hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set('fs.s3a.access.key', aws_access_key_id)
hadoopConf.set('fs.s3a.secret.key', aws_secret_access_key)
hadoopConf.set('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
spark = SparkSession(sc)
df = spark.read.csv('s3://s3path/File.csv')
And it gives me the error :
py4j.protocol.Py4JJavaError: An error occurred while calling o34.csv.
: java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
And similar error while reading Parquet files :
py4j.protocol.Py4JJavaError: An error occurred while calling o34.parquet.
: java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
How to resolve this?