11

I am trying to write a large dataset (10 cols, 100M records) from R to SAP HANA using RJDBC's dbWritetable in the following way

library("RJDBC")
drv <- JDBC("com.sap.db.jdbc.Driver", "/data/hdbclient/ngdbc.jar", "'")
database <- dbConnect( drv,"jdbc:sap://servername", "USER", "PASS")

dbWriteTable(database, "largeSet", largeSet)

This works, but is extremely slow (75k records per HOUR). I have tested RODBC (sqlsave) as well and this shows the same issue.

Looking at the code behind dbWriteTable it seems that writing is record by record (i.e. the same as insert into) and indeed using a line by line insert into using dbSendUpdate shows the same performance. I have verified that the problem is not in the connection speed itself.

ROracle has a bulk_write option which seems to solve this issue, but since I am trying to write to HANA I need RJDBC or RODBC.

Can anyone tell me how I can speed up the write to HANA by running a bulk write or some other method?

Boghyon Hoffmann
  • 17,103
  • 12
  • 72
  • 170
Michiel
  • 179
  • 1
  • 14
  • The `RMySQL` package uses the `LOAD DATA INFILE` command to mass insert rows in the table. This does not seem to be possible with `RJDBC`, but you can always export your data to CSV and load that to HANA manually or writing a function to do all these steps for you. – daroczig Aug 21 '15 at 21:41
  • @daroczig Yes, so I'm essentially looking for a way to make RJDBC use such a command (perhaps by writing a wrapper based on RMySQL or Rsqlite). The CSV export/HANA import is something I want to avoid, because it is an unnecessary additional step and performance will be an issue (I need to write 3 billion records) – Michiel Aug 22 '15 at 05:38
  • I am pretty sure that you won't have much luck with `JDBC` for inserting large amount of data. See e.g. these benchmarks: https://github.com/szilard/benchm-R-mysql The CSV way will be a lot faster: I've been doing that with other databases -- happy to share some more details on the approach later and the script, which might be reused for this purpose, if interested. – daroczig Aug 22 '15 at 08:09
  • @daroczig I have found a possible solution here: https://developer.teradata.com/blog/ulrich/2013/11/a-wider-test-case-on-r-jdbc-fastload but it seems that the setString and setInt commands are teradata specific, as I'm getting errors trying this on HANA. Any suggestions? – Michiel Aug 24 '15 at 11:37
  • I can't comment on the performance of the Teradata blog solution, but if you really care about performance, there is no way around CSV (or binary) imports. The Teradata solution uses prepared statements, which is an improvement, but it will still suffer a significant performance hit. HANA with its columnar data storage just doesn't like single inserts. – Bouncner Dec 23 '15 at 15:11

1 Answers1

1

If your main goal is a speed-up, without changing too much else, you could switch to the sjdbc package, which is a lot more performant in this respect than RJDBC (which sadly didn't get much attention in recent years).

While I write this and check back on CRAN, it looks like Simon has just recently found back to it and published a new release only a week ago. This does in fact include an improvement to dbSendUpdate:

https://cran.r-project.org/web/packages/RJDBC/NEWS

RolandASc
  • 3,863
  • 1
  • 11
  • 30