0

We are trying into insert a dataframe into Oracle database using RODBC connection and with sqlQuery function. We are also using mclapply function to achieve parallelism. The insertion performance is very slow as it is taking 6 hours to insert 3.6 million records. There are no logs created in the database side.

Please let us know how to improve insertion performance in R and let us know if any more information is required.

Target1Conn<-odbcConnect('TARGET_DATABASE','USERNAME','PASSWORD')
if(nrow(InputData)>0)
{
     sqls<-sprintf(paste0('insert into ', 
         outputTableName,'(col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15) 
         values(%s)'), 
         apply(InputData, 1, function(i) paste(i, collapse=",")))
 }
 mclapply(sqls, function(s) {
     sqlQuery(Target1Conn, s)
 },mc.cores=4)
James Z
  • 12,209
  • 10
  • 24
  • 44
  • Did you try `sqlSave` instead of looping through rows: `sqlSave(Target1Conn, InputData, outputTableName, fast=TRUE, append=TRUE, rownames=FALSE)`? – Parfait Dec 09 '19 at 21:29
  • Hi Parfait, we have tried sqlSave but it is creating table default datatypes and not with specified data types. More over it is taking same time to load data. – santhosh.challa Dec 10 '19 at 04:35
  • Can you try the `odbc` package which follows R's DBI standard (that includes RJDBC, ROracle, RMySQL, RSQLite) and use `dbWriteTable`? – Parfait Dec 10 '19 at 04:57
  • we are facing some issues with odbc package and trying to resolve it. Once the issue is resolved, we will try to implement it with odbc and test the performance. – santhosh.challa Dec 10 '19 at 08:55

1 Answers1

0

In Oracle, appending millions of records with INSERT...VALUES is suboptimal compared to a single insert select as well-known blogs stress:

dba-oracle.com:

Don't use standard SQL inserts - They are far slower than other approaches.

asktom.oracle.com:

The fastest would be to disable the indexes (mark them unusable) and do this in a SINGLE insert

stackoverflow.com @Lokesh:

A select insert is the fastest approach as everything stays in RAM.


Therefore, consider building a staging, temp table with RODBC's sqlSave or DBI's dbWriteTable. And to help align variable types. See this SO answer by @Linda.

RODBC::sqlSave(Target1Conn, InputData, "myTempTable", fast=TRUE, append=FALSE, rownames=FALSE)

DBI::dbWriteTable(Target1Conn, InputData, "myTempTable", append=FALSE, row.names=FALSE)

Then, run a single insert select (and try with Oracle's APPEND hint as shown below):

sql <- "INSERT /*+ append */ INTO myFinalTable (col1, col2, col3, col4, col5,
                                  col6, col7, col8, col9, col10,
                                  col11, col12, col13, col14, col15)
        SELECT col1, col2, col3, col4, col5,
               col6, col7, col8, col9, col10,
               col11, col12, col13, col14, col15
        FROM myTempTable"


RODBC::sqlQuery(Target1Conn, sql)

DBI::dbExecute(Target1Conn, sql)

Further optimization techniques. Be sure to discuss with your DBA before proceeding.

  • Used BULK COLLECT (if available) inside a stored procedure still using above temp table.

    Oracle (run only once)

    CREATE OR REPLACE PROCEDURE load_data
    IS
    TYPE TObjectTable IS TABLE OF ALL_OBJECTS%ROWTYPE;
    ObjectTable$ TObjectTable;
    
    BEGIN
      SELECT * BULK COLLECT INTO ObjectTable$
        FROM myTempTable;
    
      FORALL x in ObjectTable$.First..ObjectTable$.Last
       INSERT INTO myFinalTable VALUES ObjectTable$(x) ;
    END;
    

    R

    RODBC::sqlQuery(Target1Conn, "EXEC load_data")
    DBI::dbExecute(Target1Conn, "EXEC load_data")
    
  • Have R export a csv or other delimited text file

    utils::write.csv(InputData, "/output/path/inputdata.csv")
    # data.table::fwrite(InputData, "/output/path/inputdata.csv")  # ALTERNATIVE
    

    Then use Oracle's SQL*Loader which is great and designed for bulk inserts and as advised by @Ben.

    1. Save below as .ctl text file such as myload.ctl

      OPTIONS (SKIP=1)
      LOAD DATA 
        INFILE "/output/path/inputdata.csv"
        INTO TABLE "myschema"."myFinalTable"
        FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' 
        (col1 INTEGER, 
         col2 DATE 'YYYY-MM-DDD', 
         col3 VARCHAR(50),
         col4 NUMBER(10,2),
         col5 CHAR(255),
         ...)
      
    2. Run sqlldr command line (outputs errors and issues in same directory)

      sqlldr CONTROL=myload.ctl, LOG=myload.log, BAD=myload.bad USERID=user/pwd
      

    Of course, R can automatically create the above .ctl with writeLines() and call sqlldr at command line with system().

  • Temporarily disable indexes, load new data using above temp table, and then re-enable indexes. See procedure here. This might be necessary also for SQL*LOADER.

Parfait
  • 104,375
  • 17
  • 94
  • 125