What I'm trying to do:
- read from and write to S3 buckets across multiple
AWS_PROFILE
's
resources:
- https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Configuring_different_S3_buckets_with_Per-Bucket_Configuration
- does show how to use different cred on per-bucket
- does show how to use different credential providers
- doesn't show how to use more than one AWS_PROFILE
- https://spark.apache.org/docs/latest/cloud-integration.html#authenticating
- https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-sso.html
- No FileSystem for scheme: s3 with pyspark
What I have working so far:
- AWS SSO works and i can access different resources in python via
boto3
by changing environment variableAWS_PROFILE
- delta spark can read and write to S3 using hadoop configurations
- enable delta tables for pyspark
builder.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog"))
- allow s3 schema for read/write
"spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem"
- use instance profile
AWS_PROFILE
for one or more buckets"fs.s3a.bucket.{prod_bucket}.aws.credentials.provider", "com.amazonaws.auth.InstanceProfileCredentialsProvider"
- enable delta tables for pyspark
any help, suggestions, comments appreciated. thanks!