1

This is a follow up question on the question that i asked and this one is more specific about how to access two s3 account using same spark session by changing hadoop configuration dynamically.

I have two s3 account i.e A and B and i am running EMR pipeline in account B that would read csv from account A write parquet file in s3 bucket in account B. I cannot add role/bucket policy of account A in EMR. so i am using credential to access account A and read csv file. This i achieved using hadoop configuration

    sc = SparkContext(appName="parquet_ingestion").getOrCreate()
    hadoop_config = sc._jsc.hadoopConfiguration()
    hadoop_config.set("fs.s3.awsAccessKeyId", dl_access_key)
    hadoop_config.set("fs.s3.awsSecretAccessKey", dl_secret_key)
    hadoop_config.set("fs.s3.awsSessionToken", dl_session_key)
    hadoop_config.set("fs.s3.path.style.access", "true");
    hadoop_config.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

and write to s3 of account B using dynamically changing hadoop configuration using below configuration.

hadoop_config = sc._jsc.hadoopConfiguration()
hadoop_config.unset("fs.s3.awsAccessKeyId")
hadoop_config.unset("fs.s3.awsSecretAccessKey")
hadoop_config.unset("fs.s3.awsSessionToken")

This works fine for limited number of files, but when i run 50 job at once, then for some 4/5 job it gets 403 Access denied error. Rest of the job was able to create parquet file in s3 bucket of account B. On analysis i found that it failed while writing the parquet file.

I asked AWS support team regarding this access denied error, they said it failed while trying do ListBukcet operation on account A during write.so job is trying to access account A using default EMR role as credential are unset and failed.

With this information i came to conclusion that hadoop unset is not working as expected for some jobs.Ideally it should have fallback to account B while writing parquet once i unset the access key and credential.

py4j.GatewayConnection.run(GatewayConnection.java:238)\n at java.lang.Thread.run(Thread.java:748)\nCaused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, ip-172-31-31-212.ap-south-1.compute.internal, executor 23): java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID:

Question: How to overcome this situation through some configuration to set it to default account(B)?

baitmbarek
  • 2,440
  • 4
  • 18
  • 26
Vikram Ranabhatt
  • 7,268
  • 15
  • 70
  • 133

1 Answers1

0

unset is odd in that if a reload of default properties is triggered then the unsets are overidden. This is a PITA. if you can force load HDFS and yarn configs (new HdfsConfiguration(); new YarnConfiguration()) then that should trigger more loads, though Hive may play this game too

If you have the S3A connector on your CP you can set the login details on a bucket-by-bucket basis -see "per bucket configuration" in the docs

stevel
  • 12,567
  • 1
  • 39
  • 50