3

I need to push a few thousand rows into Redshift multiple times a day. But I can't use bulk Insert from S3 for administrated reasons. What's the fastest way?

Details:

There are 3 ways (that I can see) to insert rows into a table in Amazon Redshift from R:

  1. Row by row insert query. Each row is inserted as its own INSERT VALUES query
  2. Multiple Row insert query: same as 1, but multiple records are inserted per query. The size is constrained either by number of rows or the 16MB max SQL query size limitation.
  3. Bulk insert from AWS S3, Dynamo, or EMR.

Each approach above is ~ an order of magnitude faster than the prior one. I'd love to use bulk insert to create or update tables, but our organization has made the security and administrative decision not to allow business users to bulk load or unload data into S3. So that makes me unable to use the package redshiftTools to upload files.

I can do number 1 above using RODBC::sqlSave. Which is slow, but gets the job done... eventually.

What'd I'd rather have is something like sqlSave that inserts multiple rows of data at once. But not so much as to go over the row/size limits for redshift. This is easy for simple data structures, but a generalized function that would handle integers, characters, dates, etc would be ideal as I'm not doing this with only one single table. So I popped open the source for sqlSave and started to roll my own function for building multiple row inserts that would chunk the data into 1000 row chunks and build and execute the query for each chunk.

But I stopped to ask on here if this has already been done? Is there a better way? I have this feeling that maybe one of the other SQL packages for R has a function to do something like this. However, when I search all I find are other people having the same sort of problem.

Any tips?

UPDATE 1

Thanks to some prompting, I investigated the fast=TRUE switch in RODBC::sqlSave. The documentation makes it sound like this is what I'm after:

fast: logical. If false, write data a row at a time. If true, use a parametrized INSERT INTO or UPDATE query to write all the data in one operation.

So I figured I should test this out. So I created a little data frame with 10 records and 2 columns:

df <- data.frame(a=rnorm(10), b=sample(letters, 10, replace=TRUE),
                 stringsAsFactors = FALSE)    

Then I used benchmark to time the execution across 5 replications:

benchmark( sqlSave(dbhandle, df, tablename = 'jal_test1', append=TRUE, fast=TRUE), replications = 5)
#                                                                         test replications elapsed relative user.self sys.self user.child sys.child
# 1 sqlSave(dbhandle, df, tablename = "jal_test1", append = TRUE, fast = TRUE)            5  512.59        1      0.08     0.03         NA        NA

benchmark( sqlSave(dbhandle, df, tablename = 'jal_test1', append=TRUE, fast=FALSE), replications = 5)
#                                                                          test replications elapsed relative user.self sys.self user.child sys.child
# 1 sqlSave(dbhandle, df, tablename = "jal_test1", append = TRUE, fast = FALSE)            5  264.37        1      0.08     0.02         NA        NA
      

That's a little hard to read but, in summary:

  • fast=TRUE took 512 seconds
  • fast=FALSE took 264 seconds

with 25 records, the time goes up to:

  • fast=TRUE took 1208 seconds
  • fast=FALSE took 604 seconds

Which makes exactly zero sense to me.

Update 2

I tried the test=TRUE switch thinking it would show me what's going on, but I can't figure out what that does at all... However turning verbose=TRUE helped me realize that fast=TRUE does not do what I thought it did. It seems to use substitution, but does not do one big insert. It still does nrow(df) worth of inserts:

> df <- data.frame(a=rnorm(5), b=sample(letters, 5, replace=TRUE), stringsAsFactors = FALSE)
> sqlSave(dbhandle, df, tablename = 'jal_test1', append=TRUE, fast=FALSE, verbose=TRUE)
Query: INSERT INTO "jal_test1" ( "rownames", "a", "b" ) VALUES ( '1', -1.45261402, 'd' )
Query: INSERT INTO "jal_test1" ( "rownames", "a", "b" ) VALUES ( '2', -0.01642518, 'm' )
Query: INSERT INTO "jal_test1" ( "rownames", "a", "b" ) VALUES ( '3',  1.11767938, 'm' )
Query: INSERT INTO "jal_test1" ( "rownames", "a", "b" ) VALUES ( '4', -0.63480166, 'a' )
Query: INSERT INTO "jal_test1" ( "rownames", "a", "b" ) VALUES ( '5', -0.75538702, 'k' )
> sqlSave(dbhandle, df, tablename = 'jal_test1', append=TRUE, fast=TRUE, verbose=TRUE)
Query: INSERT INTO "jal_test1" ( "rownames", "a", "b" ) VALUES ( ?,?,? )
Binding: 'rownames' DataType -9, ColSize 255
Binding: 'a' DataType 6, ColSize 17
Binding: 'b' DataType -9, ColSize 255
Parameters:
no: 1: rownames 1/***/no: 2: a -1.45261/***/no: 3: b d/***/
no: 1: rownames 2/***/no: 2: a -0.0164252/***/no: 3: b m/***/
no: 1: rownames 3/***/no: 2: a 1.11768/***/no: 3: b m/***/
no: 1: rownames 4/***/no: 2: a -0.634802/***/no: 3: b a/***/
no: 1: rownames 5/***/no: 2: a -0.755387/***/no: 3: b k/***/
Community
  • 1
  • 1
JD Long
  • 59,675
  • 58
  • 202
  • 294

3 Answers3

2

I was ultimately unable to find an implementation of an SQL write function that would do chunking in R. But I did see that in Python the sqlalchemy package teamed with pandas could easily do this. So I whipped out Reticulate and wrapped Python in some R code to create a function to write to redshift. Seems like overkill, but it gets the job done without me having to reimplement anything:

start_python <- function(){
  library(reticulate)
  use_condaenv( "r-reticulate")
  pd <- import('pandas')
  sa <- import('sqlalchemy')
}

# write a table to RDW sandbox
write_to_redshift <- function(df, tablename, if_exists = 'append'){
  pd_df <- r_to_py(df)
  eng = sa$create_engine('postgres://user:pwd@redshift_name:5439/db_name')
  conn = eng$connect()
  write_result <- pd_df$to_sql( name=tablename, con=conn,  index = FALSE, if_exists = if_exists, schema='my_schema', chunksize=10000L)
  conn$close()
  return(write_result)
}
JD Long
  • 59,675
  • 58
  • 202
  • 294
  • For the codeline -eng = sa$create_engine('postgres://user:pwd@redshift_name:5439/db_name')- postgres:// needs to now change to 'postgresql:// because SQLAlchemy used to accept both, but has removed support for the postgres name. – Skurup Oct 26 '21 at 06:16
1

I found this answer when trying to troubleshoot why it was taking a long time to write out a ~5000 row data frame from Pandas (imported via reticulate to R), and wanted to post an update.

As of version 0.24.0 (January 2019) there is a method argument in pd.DataFrame.to_sql that seems to actually do reasonable chunking when you set method = 'multi'. Comparing two ways to write out the data frame to Redshift I see roughly a 50x speedup.

# Called from R, takes ~500 seconds for a 5000 row data frame
write_result <- pd_df$to_sql( name=tablename
    , con=conn
    , index = FALSE
    , if_exists = if_exists
    , schema='my_schema'
    , chunksize=1000L)

# Takes ~10 seconds for the same 5000 row data frame.
write_result <- pd_df$to_sql( name=tablename
    , con=conn
    , index = FALSE
    , if_exists = if_exists
    , schema='my_schema'
    , chunksize=1000L
    , method='multi') # this is the new argument
Moderat
  • 1,462
  • 4
  • 17
  • 21
0

Recommend doing a bulk load with a non-S3 option:

  1. COPY from Amazon EMR
  2. COPY from Remote Host (SSH)
  3. COPY from Amazon DynamoDB

https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-source.html

Joe Harris
  • 13,671
  • 4
  • 47
  • 54
  • looking at the documentation, it looks like all these options require S3 access and associated identity and access maintenance IAM credentials. Our current security approach makes that impossible as we give no IAM access to any account other than the ETL process that populates redshift. So I don't think I have credentials to do any of these options. I think I'm back to using SQL for updates using `INSERT VALUES`... – JD Long Jun 01 '18 at 19:22