1

In Redshift, I run the following to unload data from a table into a file in S3:

unload('select * from table')
to 's3://bucket/unload/file_'
iam_role 'arn:aws:iam:<aws-account-id>:role/<role_name>'

I would like to do the same in Python- any suggestion how to replicate this? I saw examples using access key and secret, but that is not an option for me- need to use role based credentials on a non-public bucket.

erik258
  • 14,701
  • 2
  • 25
  • 31
user8834780
  • 1,620
  • 3
  • 21
  • 48
  • Why is providing credentials not an option for you? – jarmod Jan 09 '18 at 22:20
  • @jarmod DevOps policy due to various cases like this https://www.itnews.com.au/news/viacom-left-keys-to-its-kingdom-exposed-on-aws-473708 – user8834780 Jan 09 '18 at 22:25
  • Well, you're going to need credentials, unless you plan to make the S3 bucket public. How you get the credentials, and how you secure them, is the important thing. The most common approach is to launch an EC2 instance with a minimally-permissioned IAM role and then run your app on that EC2 instance. The app doesn't need direct access to the credentials (the AWS SDK gets them for you), and the credentials are time-limited and auto-rotated. Obviously, you also have to protect the EC2 instance from attackers in the usual ways. – jarmod Jan 09 '18 at 22:58
  • But why should role based access not work, while key based access does? There should be an option to use either access method as per https://docs.aws.amazon.com/redshift/latest/dg/copy-usage_notes-access-permissions.html#copy-usage_notes-access-role-based – user8834780 Jan 09 '18 at 23:09
  • I may be misunderstanding your question. They will both work. They're both credentialled. There's nothing magical about IAM roles. There's still a set of credentials in play (they're just time-limited and made available to the AWS SDKs in a different way). – jarmod Jan 09 '18 at 23:12
  • 1
    Maybe a more useful question would be, where is the Python run? I'm guessing your snippet was run through the console under your iam user, which uses a password to generate the same kind of temporary credentials a role uses. Looking at the answer below, it seems likely that redshift console is probably generating the SQL user and password automatically, possibly even invisibly. – erik258 Jan 10 '18 at 14:45
  • @DanFarrell my snippet runs in Redshift just as any SQL query would – user8834780 Jan 11 '18 at 14:43
  • 1
    But where does the Python run? – erik258 Jan 11 '18 at 14:59

1 Answers1

4

You will need two sets of credentials. IAM credentials via an IAM Role to access the S3 bucket and Redshift ODBC credentials to execute SQL commands.

Create a Python program that connects to Redshift, in a manner similar to other databases such as SQL Server, and execute your query. This program will need Redshift login credentials and not IAM credentials (Redshift username, password).

The IAM credentials for S3 are assigned as a role to Redshift so that Redshift can store the results on S3. This is the iam_role 'arn:aws:iam:<aws-account-id>:role/<role_name>' part of the Redshift query in your question.

You do not need boto3 (or boto) to access Redshift, unless you plan to actually interface with the Redshift API (which does not access the database stored inside Redshift).

Here is an example Python program to access Redshift. The link to this code is here. Credit due to Varun Verma

There are other examples on the Internet to help you get started.

############ REQUIREMENTS ####################
# sudo apt-get install python-pip 
# sudo apt-get install libpq-dev
# sudo pip install psycopg2
# sudo pip install sqlalchemy
# sudo pip install sqlalchemy-redshift
##############################################

import sqlalchemy as sa
from sqlalchemy.orm import sessionmaker

#>>>>>>>> MAKE CHANGES HERE <<<<<<<<<<<<< 
DATABASE = "dbname"
USER = "username"
PASSWORD = "password"
HOST = "host"
PORT = ""
SCHEMA = "public"      #default is "public" 

####### connection and session creation ############## 
connection_string = "redshift+psycopg2://%s:%s@%s:%s/%s" % (USER,PASSWORD,HOST,str(PORT),DATABASE)
engine = sa.create_engine(connection_string)
session = sessionmaker()
session.configure(bind=engine)
s = session()
SetPath = "SET search_path TO %s" % SCHEMA
s.execute(SetPath)
###### All Set Session created using provided schema  #######

################ write queries from here ###################### 
query = "unload('select * from table') to 's3://bucket/unload/file_' iam_role 'arn:aws:iam:<aws-account-id>:role/<role_name>';"
rr = s.execute(query)
all_results =  rr.fetchall()

def pretty(all_results):
    for row in all_results :
        print "row start >>>>>>>>>>>>>>>>>>>>"
        for r in row :
            print " ----" , r
        print "row end >>>>>>>>>>>>>>>>>>>>>>"


pretty(all_results)


########## close session in the end ###############
s.close()
John Hanley
  • 74,467
  • 6
  • 95
  • 159
  • Thank you, but unfortunately this is a script to connect to Redshift via Python, not to unload from Redshift to S3 in Python, as the question asked. – user8834780 Jan 10 '18 at 02:52
  • 1
    You need to connect to Redshift first before you can execute any kind of SQL statement. I have modified the example to show your unload statement in the example. – John Hanley Jan 10 '18 at 04:35
  • I think what's missing from this is a redshift API call to https://docs.aws.amazon.com/redshift/latest/APIReference/API_GetClusterCredentials.html to generate an SQL credential. You can use credentials from an assumed role to authenticate that API request. From what I can tell there's no way to interact with the SQL database without an SQL credential. – erik258 Jan 10 '18 at 14:53
  • 1
    @DanFarrell. Good comment - I like your suggestion from a security view. Calling get_cluster_credentials will require a third set of credentials - IAM Access Keys or an IAM Role - to call the Redshift management API (these could be the same as the ones used for S3 but not recommended). This API would then return temporary credentials for logging into the Redshift cluster to execute the SQL query. These credentials would replace the ones used in my example (USER / PASSWORD). – John Hanley Jan 11 '18 at 02:06
  • @JohnHanley this worked for unload, but I am trying to run a create table statement, query runs - but trying to query it it says table doesnt exist. Any idea? – user8834780 Jan 12 '18 at 18:12
  • Create a new question and include your code so that we can see what is wrong plus the exact error message. – John Hanley Jan 12 '18 at 18:29
  • @JohnHanley will do, unfortunately no error message shown but when I try to query table later it says table doesnt exist – user8834780 Jan 12 '18 at 18:42
  • @JohnHanley https://stackoverflow.com/questions/48232367/redshift-create-table-not-working-via-python – user8834780 Jan 12 '18 at 19:00