5

Scenario : - I am trying to import a large data set from Oracle DB (Few Million records) into ElasticSearch. - I am able to import data using logstash jdbc input plugin.

Issue : - Its taking a very long time (5 hrs) to complete.

  • Is there a better solution to reduce this time ? Good practices in such scenarios?
shan
  • 288
  • 4
  • 11

3 Answers3

4

You can play with jdbc_fetch_size first, the next step will be running import on several nodes, for each node split the sql query into nonoverlapping sets like select * from XX where id between 0 and 1000000.

You can also set number of ES index replicas to 0 before indexing big bulk and change it back after. And of course use bulk indexing API with good batch size and parallelism.

xeye
  • 1,250
  • 10
  • 15
  • Thanks xeye for answering ! To increase the fetch_size I need to have enough heap space to hold the fetch record. I need to figure out the best fetch_size which won't give me OutOfMemory exception. Can we use any queues in between DB and ES ? like Kafka ? – shan Jun 27 '16 at 19:06
  • of course you can use kafka, but kafka alone will not improve the indexing performance. – xeye Jun 27 '16 at 22:52
  • I was planning for kafka to avoid holding DB connection for the entire time of indexing. Once data is pushed to kafka, we don't need DB connection. How to increase the indexing performance ? I know we can set flush_size for ElasticSearch to make use of bulk api. – shan Jun 28 '16 at 00:51
2

You can use

:sql_last_value

option to get data from the database incrementally with a scheduler. Something like following will be helpful. (Assuming you have an ID field)

input {
   jdbc {
      jdbc_connection_string => "jdbc:oracle:thin:IP:PORT:SID"
      jdbc_user => "user"
      jdbc_password => "password"
      jdbc_driver_library => ".......\ojdbc8.jar"
      jdbc_driver_class => "Java::oracle.jdbc.OracleDriver"
      statement => "select * from (select * from TABLE where id >= :sql_last_value ORDER BY id ASC) rownum < 20"
      use_column_value => true
      tracking_column => id
      tracking_column_type => "numeric"
      schedule => "* * * * *"
  } 

}

update: Refactored SQL to use rownum and ORDER BY to limit the results in the intended order (sort before limiting). See: On ROWNUM and Limiting Results

Same result can be achieved using pagination but it has performance issues. If we use pagination ex:

input {
   jdbc {
      jdbc_connection_string => "jdbc:oracle:thin:IP:PORT:SID"
      jdbc_user => "user"
      jdbc_password => "password"
      jdbc_driver_library => ".......\ojdbc8.jar"
      jdbc_driver_class => "Java::oracle.jdbc.OracleDriver"
      jdbc_page_size => 20
      jdbc_paging_enabled => true
      statement => "select * from TABLE"
  } 
}

But this will wrap

"select * from TABLE"

like

SELECT * FROM (SELECT "T1".*, ROWNUM "X_SEQUEL_ROW_NUMBER_X" FROM (SELECT * FROM (select * from TABLE) "T1") "T1") "T1" WHERE (("X_SEQUEL_ROW_NUMBER_X" > 0) AND ("X_SEQUEL_ROW_NUMBER_X" <= (0 + 20)));

and will run without a scheduler by dividing the query based on the specified jdbc_page_size (20 for this example). But this method obviously has performance issues.

aphahn
  • 23
  • 5
Erangad
  • 671
  • 7
  • 16
-1

You can put more than jdbc input plugin in the same config file .. It works fine with me.

i.e:

input { jdbc { ... }

   jdbc {
       ...
      }

}