This is a follow up question on the question that i asked and this one is more specific about how to access two s3 account using same spark session by changing hadoop configuration dynamically.
I have two s3 account i.e A and B and i am running EMR pipeline in account B that would read csv from account A write parquet file in s3 bucket in account B. I cannot add role/bucket policy of account A in EMR. so i am using credential to access account A and read csv file. This i achieved using hadoop configuration
sc = SparkContext(appName="parquet_ingestion").getOrCreate()
hadoop_config = sc._jsc.hadoopConfiguration()
hadoop_config.set("fs.s3.awsAccessKeyId", dl_access_key)
hadoop_config.set("fs.s3.awsSecretAccessKey", dl_secret_key)
hadoop_config.set("fs.s3.awsSessionToken", dl_session_key)
hadoop_config.set("fs.s3.path.style.access", "true");
hadoop_config.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
and write to s3 of account B using dynamically changing hadoop configuration using below configuration.
hadoop_config = sc._jsc.hadoopConfiguration()
hadoop_config.unset("fs.s3.awsAccessKeyId")
hadoop_config.unset("fs.s3.awsSecretAccessKey")
hadoop_config.unset("fs.s3.awsSessionToken")
This works fine for limited number of files, but when i run 50 job at once, then for some 4/5 job it gets 403 Access denied error. Rest of the job was able to create parquet file in s3 bucket of account B. On analysis i found that it failed while writing the parquet file.
I asked AWS support team regarding this access denied error, they said it failed while trying do ListBukcet operation on account A during write.so job is trying to access account A using default EMR role as credential are unset and failed.
With this information i came to conclusion that hadoop unset is not working as expected for some jobs.Ideally it should have fallback to account B while writing parquet once i unset the access key and credential.
py4j.GatewayConnection.run(GatewayConnection.java:238)\n at java.lang.Thread.run(Thread.java:748)\nCaused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, ip-172-31-31-212.ap-south-1.compute.internal, executor 23): java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID:
Question: How to overcome this situation through some configuration to set it to default account(B)?