I am attempting to load data from an AWS S3 bucket with spark. I continue to get the error:
Py4JJavaError: An error occurred while calling o152.csv. : com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: M4Z1B0MTQAY2GDCD, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: GS9ftm1p/TpmZNS4KtsAVmmRfQOIVnIg/22rhnI4i5HKF40pT/QGBAXTwrVNWsHCUQFhEOXD3Gk= at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
My code can be seen below:
I saved my AWS Access Key ID and AWS Secret Access key in the credentials.cfg
file.
I have a file called payment.csv
in a bucket called datalakesexamp1
in S3.
from pyspark.sql import SparkSession
import os
import configparser
config = configparser.ConfigParser()
config.read_file(open('aws/credentials.cfg'))
os.environ["AWS_ACCESS_KEY_ID"]= config['AWS']['AWS_ACCESS_KEY_ID']
os.environ["AWS_SECRET_ACCESS_KEY"]= config['AWS']['AWS_SECRET_ACCESS_KEY']
spark = SparkSession.builder\
.config("spark.jars.packages","org.apache.hadoop:hadoop-aws:2.7.0")\
.getOrCreate()
df = spark.read.csv("s3a://datalakesexamp1/payment.csv")
I believe that it could be the 2.7.0 version of hadoop but I can't seem to figure out where to find the correct one. I have tried other versions and seem to get similar errrors.