2

What I'm trying to do:

  • read from and write to S3 buckets across multiple AWS_PROFILE's

resources:

What I have working so far:

  • AWS SSO works and i can access different resources in python via boto3 by changing environment variable AWS_PROFILE
  • delta spark can read and write to S3 using hadoop configurations
    • enable delta tables for pyspark
      builder.config("spark.sql.extensions",           
          "io.delta.sql.DeltaSparkSessionExtension")
      .config("spark.sql.catalog.spark_catalog",
          "org.apache.spark.sql.delta.catalog.DeltaCatalog"))
      
    • allow s3 schema for read/write
        "spark.hadoop.fs.s3.impl",
        "org.apache.hadoop.fs.s3a.S3AFileSystem"
      
    • use instance profile AWS_PROFILE for one or more buckets
      "fs.s3a.bucket.{prod_bucket}.aws.credentials.provider",
      "com.amazonaws.auth.InstanceProfileCredentialsProvider"
      

any help, suggestions, comments appreciated. thanks!
123
  • 595
  • 6
  • 18
  • i think i can use IAM roles to span multiple profiles? i'll try it https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/assumed_roles.html – 123 Oct 13 '22 at 20:12

1 Answers1

3

As of October 2022, the s3a connector doesn't support AWS SSO/identity server. Moving to the AWS SDK v2 is a prerequisite, which is a WiP.

See HADOOP-18352

stevel
  • 12,567
  • 1
  • 39
  • 50
  • Thank you, Steve! I should've checked more thoroughly. Appreciate all your contributions to Hadoop! – 123 Oct 14 '22 at 16:33
  • 1
    no worries. maybe if you use the sso tools on the CLI and get the credentials exported locally, you could then get the current session creds into the spark conf... – stevel Oct 15 '22 at 13:57