I need to push a few thousand rows into Redshift multiple times a day. But I can't use bulk Insert from S3 for administrated reasons. What's the fastest way?
Details:
There are 3 ways (that I can see) to insert rows into a table in Amazon Redshift from R:
- Row by row insert query. Each row is inserted as its own
INSERT VALUES
query - Multiple Row insert query: same as 1, but multiple records are inserted per query. The size is constrained either by number of rows or the 16MB max SQL query size limitation.
- Bulk insert from AWS S3, Dynamo, or EMR.
Each approach above is ~ an order of magnitude faster than the prior one. I'd love to use bulk insert to create or update tables, but our organization has made the security and administrative decision not to allow business users to bulk load or unload data into S3. So that makes me unable to use the package redshiftTools to upload files.
I can do number 1 above using RODBC::sqlSave
. Which is slow, but gets the job done... eventually.
What'd I'd rather have is something like sqlSave
that inserts multiple rows of data at once. But not so much as to go over the row/size limits for redshift. This is easy for simple data structures, but a generalized function that would handle integers, characters, dates, etc would be ideal as I'm not doing this with only one single table. So I popped open the source for sqlSave
and started to roll my own function for building multiple row inserts that would chunk the data into 1000 row chunks and build and execute the query for each chunk.
But I stopped to ask on here if this has already been done? Is there a better way? I have this feeling that maybe one of the other SQL packages for R has a function to do something like this. However, when I search all I find are other people having the same sort of problem.
Any tips?
UPDATE 1
Thanks to some prompting, I investigated the fast=TRUE
switch in RODBC::sqlSave
. The documentation makes it sound like this is what I'm after:
fast: logical. If false, write data a row at a time. If true, use a parametrized INSERT INTO or UPDATE query to write all the data in one operation.
So I figured I should test this out. So I created a little data frame with 10 records and 2 columns:
df <- data.frame(a=rnorm(10), b=sample(letters, 10, replace=TRUE),
stringsAsFactors = FALSE)
Then I used benchmark
to time the execution across 5 replications:
benchmark( sqlSave(dbhandle, df, tablename = 'jal_test1', append=TRUE, fast=TRUE), replications = 5)
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 sqlSave(dbhandle, df, tablename = "jal_test1", append = TRUE, fast = TRUE) 5 512.59 1 0.08 0.03 NA NA
benchmark( sqlSave(dbhandle, df, tablename = 'jal_test1', append=TRUE, fast=FALSE), replications = 5)
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 sqlSave(dbhandle, df, tablename = "jal_test1", append = TRUE, fast = FALSE) 5 264.37 1 0.08 0.02 NA NA
That's a little hard to read but, in summary:
fast=TRUE
took 512 secondsfast=FALSE
took 264 seconds
with 25 records, the time goes up to:
fast=TRUE
took 1208 secondsfast=FALSE
took 604 seconds
Which makes exactly zero sense to me.
Update 2
I tried the test=TRUE
switch thinking it would show me what's going on, but I can't figure out what that does at all... However turning verbose=TRUE
helped me realize that fast=TRUE
does not do what I thought it did. It seems to use substitution, but does not do one big insert. It still does nrow(df)
worth of inserts:
> df <- data.frame(a=rnorm(5), b=sample(letters, 5, replace=TRUE), stringsAsFactors = FALSE)
> sqlSave(dbhandle, df, tablename = 'jal_test1', append=TRUE, fast=FALSE, verbose=TRUE)
Query: INSERT INTO "jal_test1" ( "rownames", "a", "b" ) VALUES ( '1', -1.45261402, 'd' )
Query: INSERT INTO "jal_test1" ( "rownames", "a", "b" ) VALUES ( '2', -0.01642518, 'm' )
Query: INSERT INTO "jal_test1" ( "rownames", "a", "b" ) VALUES ( '3', 1.11767938, 'm' )
Query: INSERT INTO "jal_test1" ( "rownames", "a", "b" ) VALUES ( '4', -0.63480166, 'a' )
Query: INSERT INTO "jal_test1" ( "rownames", "a", "b" ) VALUES ( '5', -0.75538702, 'k' )
> sqlSave(dbhandle, df, tablename = 'jal_test1', append=TRUE, fast=TRUE, verbose=TRUE)
Query: INSERT INTO "jal_test1" ( "rownames", "a", "b" ) VALUES ( ?,?,? )
Binding: 'rownames' DataType -9, ColSize 255
Binding: 'a' DataType 6, ColSize 17
Binding: 'b' DataType -9, ColSize 255
Parameters:
no: 1: rownames 1/***/no: 2: a -1.45261/***/no: 3: b d/***/
no: 1: rownames 2/***/no: 2: a -0.0164252/***/no: 3: b m/***/
no: 1: rownames 3/***/no: 2: a 1.11768/***/no: 3: b m/***/
no: 1: rownames 4/***/no: 2: a -0.634802/***/no: 3: b a/***/
no: 1: rownames 5/***/no: 2: a -0.755387/***/no: 3: b k/***/