I'm experimenting with how to connect R with Amazon's Redshift - and publishing a short blog for other newbies.
Some good progress - I'm able to do most things (create tables, select data, and even sqlSave or dbSendQuery 'line by line' HOWEVER, I have not found a way to do a BULK UPLOAD of a table in one shot (e.g. copy the whole 5X150 IRIS table/data frame to Redshift) - that doesnt take more than a minute.
Question: Any advice for a newish person to RPostgreSQL on how to write/upload a block of data to Redshift would be greatly appreciated!
RODBC:colnames(iris) <- tolower(colnames(iris))
sqlSave(channel,iris,"iris", rownames=F)
SLOOOOOOW! SO SLOW! Must be a better way 150 ~1.5 minutes
iris_results <- sqlQuery(channel,"select * from iris where species = 'virginica'") # fast subset. this does work and shows up on AWS Redshift Dashboard
sqlDrop(channel, "iris", errors = FALSE) # clean up our toys
RPostgreSQL
dbSendQuery(con, "create table iris_200 (sepallength float,sepalwidth float,petallength float,petalwidth float,species VARCHAR(100));")
dbListFields(con,"iris_200")
ONE BY ONE insert four rows into the table
dbSendQuery(con, "insert into iris_200 values(5.1,3.5,1.4,0.2,'Iris-setosa');")
dbSendQuery(con, "insert into iris_200 values(5.5,2.5,1.1,0.4,'Iris-setosa');")
dbSendQuery(con, "insert into iris_200 values(5.2,3.3,1.2,0.3,'Iris-setosa');")
dframe <-dbReadTable(con,"iris_200") # ok
dbRemoveTable(con,"iris_200") # and clean up toys
or loop through table (takes about 1 per second)
for (i in 1:(dim(iris_200)[1]) ) {
query <- paste("insert into iris_200 values(",iris_200[i,1],",",iris_200[i,2],",",
iris_200[i,3],",",iris_200[i,4],",","'",iris_200[i,5],"'",");",sep="")
print(paste("row",i,"loading data >> ",query))
dbSendQuery(con, query)
}
So briefly, this is the hacky/slow way - any advice on how to upload/insert bulk data appreciated - thanks!!
Full code here:
- RPostgreSQL: https://dreamtolearn.com/node/8OVIWY2C5RLZZSYD9OHCKM8AI/92
- GITHUB - rustyoldrake/RODBC_AWS_Redshift
PS - got this error message: LOAD source is not supported. (Hint: only S3 or DynamoDB or EMR based load is allowed)
Update 6/12/2015 - Direct Load of bulk data at reasonable speed may not be possible, noting error message above, and noted in this blog - LOADING DATA section of http://dailytechnology.net/2013/08/03/redshift-what-you-need-to-know/
It notes
So now that we’ve created out data structure, how do we get data into it? You have two choices: 1) Amazon S3 2) Amazon DynamoDB Yes, you could simply run a series of INSERT statements, but that is going to be painfully slow. (!)
Amazon recommends using the S3 method, which I will describe briefly. I don’t see the DynamoDB as particularly useful unless you’re already using that and want to migrate some of your data to Redshift.
To get the data from your local network to S3.....
RA: Will post updates if I figure this out