1

I need to connect spark to my redshift instance to generate data . I am using spark 1.6 with scala 2.10 . Have used compatible jdbc connector and spark-redshift connector. But i am facing a weird problem that is : I am using pyspark

df=sqlContext.read\
    .format("com.databricks.spark.redshift")\
    .option("query","select top 10 * from fact_table")\
    .option("url","jdbc:redshift://redshift_host:5439/events?user=usernmae&password=pass")\
    .option("tempdir","s3a://redshift-archive/").load()

When i do df.show() then it gives me error of permission denied on my bucket. This is weird because i can see files being created in my bucket, but they can be read.

PS .I have set accesskey and secret access key also.

PS . I am also confused between s3a and s3n file system. Connector used : https://github.com/databricks/spark-redshift/tree/branch-1.x

nightgaunt
  • 890
  • 12
  • 30
Aldrin Machado
  • 97
  • 1
  • 10

1 Answers1

2

It seems the permission is not set for Redshift to Access the S3 files. Please follow the below steps

  1. Add a bucket policy to that bucket that allows the Redshift Account
  2. access Create an IAM role in the Redshift Account that redshift can

  3. assume Grant permissions to access the S3 Bucket to the newly created role Associate the role with the Redshift cluster

  4. Run COPY statements
BigData-Guru
  • 1,161
  • 1
  • 15
  • 20
  • The above steps are perfect for solving permission issue on spark redshift connector ,but in my case the problem was with spark version .I earlier used 1.6 spark which gave me error, But same code in spark 2.2 works fine. – Aldrin Machado Jun 20 '19 at 05:26