4

I am using the following code to connect to a PostgreSQL 12 database:

con <- DBI::dbConnect(odbc::odbc(), driver, server, database, uid, pwd, port)

This connects me to a PostgreSQL 12 database on Google Cloud SQL. The following code is then used to upload data:

DBI::dbCreateTable(con, tablename, df)
DBI::dbAppendTable(con, tablename, df)

where df is a data frame I have created in R. The data frame consists of ~ 550,000 records totaling 713 MB of data.

When uploaded by the above method, it took approximately 9 hours at a rate of 40 write operations/second. Is there a faster way to upload this data into my PostgreSQL database, preferably through R?

Dylan Russell
  • 936
  • 1
  • 10
  • 29

2 Answers2

9

I've always found bulk-copy to be the best, external to R. The insert can be significantly faster, and your overhead is (1) writing to file, and (2) the shorter run-time.

Setup for this test:

  • win10 (2004)
  • docker
  • postgres:11 container, using port 35432 on localhost, simple authentication
  • a psql binary in the host OS (where R is running); should be easy with linux, with windows I grabbed the "zip" (not installer) file from https://www.postgresql.org/download/windows/ and extracted what I needed
  • I'm using data.table::fwrite to save the file because it's fast; in this case write.table and write.csv are still much faster than using DBI::dbWriteTable, but with your size of data you might prefer something quick
DBI::dbCreateTable(con2, "mt", mtcars)
DBI::dbGetQuery(con2, "select count(*) as n from mt")
#   n
# 1 0

z1000 <- data.table::rbindlist(replicate(1000, mtcars, simplify=F))
nrow(z1000)
# [1] 32000
system.time({
  DBI::dbWriteTable(con2, "mt", z1000, create = FALSE, append = TRUE)
})
#    user  system elapsed 
#    1.56    1.09   30.90 

system.time({
  data.table::fwrite(z1000, "mt.csv")
  URI <- sprintf("postgresql://%s:%s@%s:%s", "postgres", "mysecretpassword", "127.0.0.1", "35432")
  system(
    sprintf("psql.exe -U postgres -c \"\\copy %s (%s) from %s (FORMAT CSV, HEADER)\" %s",
            "mt", paste(colnames(z1000), collapse = ","),
            sQuote("mt.csv"), URI)
  )
})    
# COPY 32000
#    user  system elapsed 
#    0.05    0.00    0.19 
DBI::dbGetQuery(con2, "select count(*) as n from mt")
#       n
# 1 64000

While this is a lot smaller than your data (32K rows, 11 columns, 1.3MB of data), a speedup from 30 seconds to less than 1 second cannot be ignored.


Side note: there is also a sizable difference between dbAppendTable (slow) and dbWriteTable. Comparing psql and those two functions:

z100 <- rbindlist(replicate(100, mtcars, simplify=F))

system.time({
  data.table::fwrite(z100, "mt.csv")
  URI <- sprintf("postgresql://%s:%s@%s:%s", "postgres", "mysecretpassword", "127.0.0.1", "35432")
  system(
    sprintf("/Users/r2/bin/psql -U postgres -c \"\\copy %s (%s) from %s (FORMAT CSV, HEADER)\" %s",
            "mt", paste(colnames(z100), collapse = ","),
            sQuote("mt.csv"), URI)
  )
})
# COPY 3200
#    user  system elapsed 
#     0.0     0.0     0.1 

system.time({
  DBI::dbWriteTable(con2, "mt", z100, create = FALSE, append = TRUE)
})
#    user  system elapsed 
#    0.17    0.04    2.95 

system.time({
  DBI::dbAppendTable(con2, "mt", z100, create = FALSE, append = TRUE)
})
#    user  system elapsed 
#    0.74    0.33   23.59 

(I don't want to time dbAppendTable with z1000 above ...)

(For kicks, I ran it with replicate(10000, ...) and ran the psql and dbWriteTable tests again, and they took 2 seconds and 372 seconds, respectively. Your choice :-) ... now I have over 650,000 rows of mtcars ... hrmph ... drop table mt ...

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    This is awesome, thanks for the hard work! Will reply and let you know how it goes. – Dylan Russell Jun 06 '20 at 03:12
  • I can't seem to get this to work. I know I have psql installed. It is in my Windows path. The `system()` command should be able to get to it. I can get into my database from the command line. I can even run the command you are writing with `sprintf()` directly in the terminal and it works. But when I try your script, nothing happens. – Dylan Russell Jun 08 '20 at 08:26
  • Ah, well I think I've found at least one thing wrong - when I change the command from `system` to `shell`, I can see the terminal output and it says 'psql.exe' is not recognized as an internal or external command... – Dylan Russell Jun 08 '20 at 08:56
  • Nevermind - I had to restart R since I had updated my path to include the psql binaries. The problem I have now is the console requires interaction to input my password. But I can't do that with the `system` or `shell` command. – Dylan Russell Jun 08 '20 at 09:04
  • Okay, that's not hard to get around. If you use the `processx` package instead of `system` or `shell`, you can use `process::run(cmd, args, env = c(PGPASSWORD="mypass"))`. The start of that is knowing that `psql` can accept its password in an env-var named `PGPASSWORD`. It's not perfectly secure in that multi-user systems allow other users to see all env-vars for each command, but perhaps you're on windows or macos and have some expectations. See https://stackoverflow.com/q/6405127/3358272 (other options are listed, too). – r2evans Jun 08 '20 at 13:37
  • This is a great answer, but it can't process a csv >2GB on Windows. – Kyouma Jul 01 '21 at 21:38
  • @r2evans would this logic work on a shinyApp that is hosted on DigitalOcean, built from a dockerfile if I install `psql` onto that image? – JJ Fantini Aug 28 '22 at 20:07
  • 1
    Yes, it should still work. – r2evans Aug 28 '22 at 20:45
  • How would i go about adding a `ON CONFLICT...` statement to the end of that query? @r2evans ` "psql -Upostgres -c \"\\copy %s (%s) from %s (FORMAT CSV, HEADER) ON CONFLICT (symbol_id, date) DO NOTHING \" %s", "equity_price_eod", paste(colnames(upload.df), collapse = ","), here::here("upload.csv"), URI ` – JJ Fantini Aug 31 '22 at 12:17
  • 1
    I use queries on `on conflict ...` every day, tbh. Instead of specifying the query on the command line, you can use `psql -f somefile.sql ...` where the `somefile.sql` has the full SQL query(ies) to execute. It's no different than any other query, did you try it and get an error? @JJFantini – r2evans Aug 31 '22 at 13:37
  • @r2evans, yes I got an error that suggested I could not use `ON CONFLICT` with the `\copy` fucntion of psql. I will use your suggested method should I need to run queries like this...working directly from R so would be some overhead making an sql file but I like the solution and it is straightforward! – JJ Fantini Aug 31 '22 at 16:30
  • 1
    Oh I see now, yes ... okay, sorry about that, my daily use-case is slightly different ... what I suggest is that you do a straight insert into a temporary table, and then do a follow-on query (bulk or straight-`DBI`) that inserts from that temp-table into the other table with an `on conflict` resolution expression. Again, mis-placed what was going on under some of my many hoods there, sorry about that. – r2evans Aug 31 '22 at 18:07
1

I suspect that dbAppendTable results in an INSERT statement per row, which can take a long time for high numbers of rows.

However, you can generate a single INSERT statement for the entire data frame using the sqlAppendTable function and run it by using dbSendQuery explicitly:

res <- DBI::dbSendQuery(con, DBI::sqlAppendTable(con, tablename, df, row.names=FALSE))
DBI::dbClearResult(res)

For me, this was much faster: a 30 second ingest reduced to a 0.5 second ingest.

Michal Charemza
  • 25,940
  • 14
  • 98
  • 165