0

I am starting a Elastic search 5 project from data that are actually in a SQL Server, so I am starting from the start:

I am thinking about how import data from my SQL Server, and especially how to synchronise my data when data are updated or added.

I saw here it is adviced to make no too frequent batch.

But how make synchronisation batchs, may I have to write it myself or is there very used tools and practices ? River and JDBC plugin feeder appears to have been really used but don't work with Elastic Search 5.*

Any help would be very welcomed.

lambdaDev
  • 500
  • 4
  • 18
  • This answer might help: http://stackoverflow.com/questions/34477095/elasticsearch-replication-of-other-system-data/34477639#34477639 – Val Jan 03 '17 at 13:21

1 Answers1

2

I'd recommend using Logstash:

  • It's easy to use and setup
  • You can do your own ETL in logstash configuration files
  • You can have multiple JDBC sources in one file
  • You'll have figure out how to make incremental (batched) updates to sync your data. It really depends on your data model.

This is a nice blog piece to begin with:

https://www.elastic.co/blog/logstash-jdbc-input-plugin

Evaldas Buinauskas
  • 13,739
  • 11
  • 55
  • 107
  • One thing you need to consider is that if you delete data from your SQL server, that won't be picked up by Logstash `jdbc`. See my comment above for an alternate solution. – Val Jan 03 '17 at 13:22
  • @Val You're right, however OP is using MSSQL so tools, mentioned in your brilliant answer, will not help. Talking about deletes - that's why I mentioned that it depends on data model. You can rebuild index and drop the old ones, you can capture DML operations in MSSQL using triggers and put them in Service Broker queues or some other tables to know what happened with that data. Thanks for your highlights though :) – Evaldas Buinauskas Jan 03 '17 at 13:48
  • Sure, just drawing attention to that fact that DELETEs are often left out of sight and when the time comes to take them into account, the whole thing falls apart just because of that. – Val Jan 03 '17 at 13:50
  • @Val interesting topic, fortunately I have to make search of historic datas and Nothing will be deleted (I guess, I hope ...) so I won't worry about that concern. I keep it in mind even though – lambdaDev Jan 03 '17 at 13:51
  • Just a warning for Logstash - If like me you have complex documents with arrays, nested objects etc. (built using joins in the DB) Logstash cannot handle this, the jdbc importer can though. They're currently working on version 5, as far as I'm aware - and I'm eagerly awaiting it! – dwjv Jan 03 '17 at 16:08
  • @dwjv that's not true. With some sql magic I can import arrays and objects quite easy. For instance arrays can be imported by creating a comma separated list and then splitting it in logstash using mutate filter. Nested documents can be formatted as xml and transformed using xml filter. – Evaldas Buinauskas Jan 03 '17 at 16:11
  • @EvaldasBuinauskas Interesting. I guess this would be dependent on your source data being stored as XML though (for the nested documents). – dwjv Jan 03 '17 at 16:17
  • @dwjv not really, I just wrap things up in a stored procedure, which builds nested documents as xml from actual tables. I'm using `FOR XML PATH`. Works like a charm :-) – Evaldas Buinauskas Jan 03 '17 at 16:20
  • 1
    @EvaldasBuinauskas Oh right, OK. The complex things are actually coming from Oracle in my case, but it looks easy enough to convert to XML too. Thanks for the insight! – dwjv Jan 03 '17 at 16:35
  • i have some questions: is logstash plugin jdbc different from jdbc importer? is logstash really more efficient than doing my own batch based on sql broker with queues system like it i suggested in https://www.elastic.co/blog/found-keeping-elasticsearch-in-sync or by @Val – lambdaDev Jan 03 '17 at 22:51
  • `GROUP_CONCAT` also does some magic when you need to import arrays. See [here](http://stackoverflow.com/questions/34690725/changing-the-input-data-in-logstash-using-a-filter/34691171#34691171) and [here](http://stackoverflow.com/questions/36915428/how-to-setup-elasticsearch-index-structure-with-multiple-entity-bindings/36982705#36982705) – Val Jan 04 '17 at 04:51
  • @user354075 These two are different tools created by different people, but serve the same purpose. I'm going for logstash just because it's maintained by Elastic team and integrated well into it. – Evaldas Buinauskas Jan 04 '17 at 06:57
  • and when I see that tutorial, I have the feeling that logstash can detect by itself if a row has been added or updated in data source, and manage by itself the transformation. Is that's the case? – lambdaDev Jan 04 '17 at 08:57
  • @lambdaDev no, not really. You could make use of `schedule` option in jdbc input and run it as a cron job, which would pull data. – Evaldas Buinauskas Jan 04 '17 at 10:20
  • so is logstash jdbc plugin entirely delete all documents before importing doculents or is it able to update docs? – lambdaDev Jan 04 '17 at 15:32
  • It doesn't delete documents by default. Look at available actions: https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-action Usually if you want to insert/update documents, just use `document_id` option: https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-document_id so logstash will either create document with id if it doesn't exist and will update document if that id already exists in that index/type – Evaldas Buinauskas Jan 04 '17 at 15:56