How to Load Data into Amazon Redshift via Python Boto3?

Question

In Amazon Redshift's Getting Started Guide, data is pulled from Amazon S3 and loaded into an Amazon Redshift Cluster utilizing SQLWorkbench/J. I'd like to mimic the same process of connecting to the cluster and loading sample data into the cluster utilizing Boto3.

However in Boto3's documentation of Redshift, I'm unable to find a method that would allow me to upload data into Amazon Redshift cluster.

I've been able to connect with Redshift utilizing Boto3 with the following code:

client = boto3.client('redshift')

But I'm not sure what method would allow me to either create tables or upload data to Amazon Redshift the way it's done in the tutorial with SQLWorkbenchJ.

fwiw aws sdk for pandas (awswrangler) has suitable methods, for example [copy_from_files()](https://aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.redshift.copy_from_files.html#awswrangler.redshift.copy_from_files) — 00schneider, Dec 16 '22 at 16:19

Alex B · Answer 1 · 2018-09-23T23:27:49.990

Right, you need psycopg2 Python module to execute COPY command.

My code looks like this:

import psycopg2
#Amazon Redshift connect string 
conn_string = "dbname='***' port='5439' user='***' password='***' host='mycluster.***.redshift.amazonaws.com'"  
#connect to Redshift (database should be open to the world)
con = psycopg2.connect(conn_string);
sql="""COPY %s FROM '%s' credentials 
      'aws_access_key_id=%s; aws_secret_access_key=%s'
       delimiter '%s' FORMAT CSV %s %s; commit;""" % 
      (to_table, fn, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,delim,quote,gzip)

#Here
#  fn - s3://path_to__input_file.gz
#  gzip = 'gzip'

cur = con.cursor()
cur.execute(sql)
con.close()

I used boto3/psycopg2 to write CSV_Loader_For_Redshift

To use psycopg2, do we need to add VPC to the Lambda Function? — Minesh Barot, Sep 23 '21 at 12:37

score 12 · Accepted Answer · answered Jan 25 '16 at 05:01

12

Go back to step 4 in that tutorial you linked. See where it shows you how to get the URL of the cluster? You have to connect to that URL with a PostgreSQL driver. The AWS SDKs such as Boto3 provide access to the AWS API. You need to connect to Redshift over a PostgreSQL API, just like you would connect to a PostgreSQL database on RDS.

answered Jan 25 '16 at 05:01

Mark B

183,023
24
297
295

Thanks! I'm now trying to figure out how to utilize SQLAlchemy instead SQLWorkbenchJ, which I'm having some difficulty utilizing (I may edit this question or create a new one). – Chris Jan 25 '16 at 17:27
If you have a new question, create a new question. Please don't edit an existing question to ask something new. – Mark B Jan 25 '16 at 17:39
Use Aginity Workbench for Redshift for connecting to the cluster. It has much better UI and features, and is recommended by the AWS folks as well. – Paladin Jan 27 '16 at 10:21

score 1 · Answer 3 · edited Feb 04 '21 at 14:23

Using psycopyg2 & get_cluster_credentials

Prerequisites -

IAM ROLE attached to respective User

IAM Role with get_cluster_credentials policy LINK
On cloud (EC2) with appropriate IAM Role attached

The below code will work only if you deploying it on a PC/VM where a user's AWS Credentials are already configured [ CLI - aws configure ] OR you are on an instance in the same Account,VPC.

Have a config.ini file -

 [Redshift]

 port = 3389

 username = please_enter_username

 database_name = please_database-name

 cluster_id = please_enter_cluster_id_name

 url = please_enter_cluster_endpoint_url

 region = us-west-2

My Redshift_connection.py

 import logging

 import psycopg2

 import boto3

 import ConfigParser


 def db_connection():
    logger = logging.getLogger(__name__)

    parser = ConfigParser.ConfigParser()

    parser.read('config.ini')

    RS_PORT = parser.get('Redshift','port')

    RS_USER = parser.get('Redshift','username')

    DATABASE = parser.get('Redshift','database_name')

    CLUSTER_ID = parser.get('Redshift','cluster_id')

    RS_HOST = parser.get('Redshift','url')

    REGION_NAME = parser.get('Redshift','region')

    client = boto3.client('redshift',region_name=REGION_NAME)

    cluster_creds = client.get_cluster_credentials(DbUser=RS_USER,
                                                DbName=DATABASE,
                                                ClusterIdentifier=CLUSTER_ID,
                                                AutoCreate=False)

 try:
   conn = psycopg2.connect(
     host=RS_HOST,
     port=RS_PORT,
     user=cluster_creds['DbUser'],
     password=cluster_creds['DbPassword'],
     database=DATABASE
   )

   return conn
 except psycopg2.Error:
   logger.exception('Failed to open database connection.')
   print "Failed"

Query Execution script -

 from Redshift_Connection import db_connection

 def executescript(redshift_cursor):
     query = "SELECT * FROM <SCHEMA_NAME>.<TABLENAME>"
     cur=redshift_cursor
     cur.execute(query)

 conn = db_connection()
 conn.set_session(autocommit=False)
 cursor = conn.cursor()
 executescript(cursor)
 conn.close()

How to Load Data into Amazon Redshift via Python Boto3?

3 Answers3

Linked