Trying to run following code on Pyspark kernel from EMR on EKS(using managed endpoint), I tried to set some s3a related spark config but seems not working
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder \
.appName("S3 Read Example") \
.getOrCreate()
spark.conf.set("spark.hadoop.fs.s3a.path.style.access", "true")
spark.conf.set("fs.s3a.acl.default", "BucketOwnerFullControl")
spark.conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.WebIdentityTokenCredentialsProvider")
# Read data from S3 using the s3a path
s3_path = "s3a://bucket/file.parquet"
df = spark.read \
.format("parquet") \
.load(s3_path)
spark.stop()
And getting following error. Can someone help to identify the issue ?
Py4JJavaError: An error occurred while calling o119.load. : org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on s3a://com.bucket.name/file-path/file.snappy.parquet: com.amazonaws.SdkClientException: Unable to execute HTTP request: Certificate for <com.bucket.name.s3.amazonaws.com> doesn't match any of the subject alternative names: [.s3.amazonaws.com, s3.amazonaws.com]: Unable to execute HTTP request: Certificate for <com.bucket.name.s3.amazonaws.com> doesn't match any of the subject alternative names: [.s3.amazonaws.com, s3.amazonaws.com]
I tried to apply following spark-defaults spark.conf.set("spark.hadoop.fs.s3a.path.style.access", "true") spark.conf.set("fs.s3a.acl.default", "BucketOwnerFullControl") spark.conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.WebIdentityTokenCredentialsProvider") spark.conf.set("fs.s3a.impl","com.amazon.ws.emr.hadoop.fs.EmrFileSystem")
Not sure if it's a spark config issue.
How we can read s3a path on spark 3 with EMR on EKS?