0

Trying to run following code on Pyspark kernel from EMR on EKS(using managed endpoint), I tried to set some s3a related spark config but seems not working

from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder \
    .appName("S3 Read Example") \
    .getOrCreate()

spark.conf.set("spark.hadoop.fs.s3a.path.style.access", "true")
spark.conf.set("fs.s3a.acl.default", "BucketOwnerFullControl")
spark.conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.WebIdentityTokenCredentialsProvider")

# Read data from S3 using the s3a path
s3_path = "s3a://bucket/file.parquet"

df = spark.read \
    .format("parquet") \
    .load(s3_path)

spark.stop()

And getting following error. Can someone help to identify the issue ?


Py4JJavaError: An error occurred while calling o119.load. : org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on s3a://com.bucket.name/file-path/file.snappy.parquet: com.amazonaws.SdkClientException: Unable to execute HTTP request: Certificate for <com.bucket.name.s3.amazonaws.com> doesn't match any of the subject alternative names: [.s3.amazonaws.com, s3.amazonaws.com]: Unable to execute HTTP request: Certificate for <com.bucket.name.s3.amazonaws.com> doesn't match any of the subject alternative names: [.s3.amazonaws.com, s3.amazonaws.com]

I tried to apply following spark-defaults spark.conf.set("spark.hadoop.fs.s3a.path.style.access", "true") spark.conf.set("fs.s3a.acl.default", "BucketOwnerFullControl") spark.conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.WebIdentityTokenCredentialsProvider") spark.conf.set("fs.s3a.impl","com.amazon.ws.emr.hadoop.fs.EmrFileSystem")

Not sure if it's a spark config issue.

How we can read s3a path on spark 3 with EMR on EKS?

1 Answers1

0

dotted bucket names aren't supported; AWS say they should only be used for web sites, not as as store of data you work on in your applications.

If you must try to to use them, set fs.s3a.path.style.access to true. however, before that, try using EMR's own s3:// connector, which is the one they officially support

stevel
  • 12,567
  • 1
  • 39
  • 50
  • I tried to setup same config but seems not working. spark.conf.set("spark.hadoop.fs.s3a.path.style.access", "true") – Trupal Patel Aug 04 '23 at 16:29
  • it work with s3:// and s3n:// path but these tables are already created with s3a:// path, we have many tables which use s3a:// path and it's working onEMR spark on EC2 clusters. but with EMR on EKS cluster we are having issue. We have to migrate all tables if we want to use s3:// or s3n:// – Trupal Patel Aug 04 '23 at 16:31