How to sync MSSQL to Elasticsearch?

Question

Every time I Google this, I find the "river" approach which is deprecated. I'm using Dapper if this is somehow a helpful information.

So what's the solution for this these days?

You don't *sync* to Elasticsearch. You *extract* the information you want to search from your source and send it to Elastic. That's what river did. This has been replaced by Logstash as explained in [https://www.elastic.co/blog/deprecating-rivers](https://www.elastic.co/blog/deprecating-rivers). Another option is to use the [.NET clients](https://www.elastic.co/guide/en/elasticsearch/client/net-api/current/index.html) to send data to Elastic directly — Panagiotis Kanavos, Aug 18 '17 at 12:28
BTW that's how all search engines work, although the wiring doesn't show so much. Crawlers pull data from sources, indexers process it to generate search indexes and query processors use the indexes to speed up searches. The roles may run on multiple machines for load balancing and scalability — Panagiotis Kanavos, Aug 18 '17 at 12:31
Thanks @PanagiotisKanavos that helped: "You don't sync to Elasticsearch. You extract the information you want to search from your source and send it to Elastic.". I just didn't know what the standard practice to do this is. — tedi, Aug 18 '17 at 12:43
@PanagiotisKanavos so, i guess that i just add an attribute to my sql table, something like a boolean `synced`. And then just pull out all "non-synced" ones and add them to elasticsearch. Would that be a good approach? — tedi, Aug 18 '17 at 12:45
@JedatKinports Using a boolean value in the database is certainly a valid approach. I have also leveraged an audit/timestamp field in the database table. Then the sync process would store the last time it ran and compare to the audit field and sync anything that has been added or updated since the last sync time. This approach tends to be less intrusive on the application/process that is maintaining the data in the database, as most tables have audit fields already that can be leveraged. — Paige Cook, Aug 18 '17 at 12:58
No. probably get it wrong. For example SQL Server provides ultra-light change tracking since 2005 that doesn't require changes to the tables and is a LOT faster, safer and more accurate than any triggers or audit columns. Using audit columns is the wrong idea here — Panagiotis Kanavos, Aug 18 '17 at 13:29
@PanagiotisKanavos could you write your comments as an answer? So that i can award you the bounty? How can i access/use the "change tracking" feature? — tedi, Aug 21 '17 at 07:11
@PanagiotisKanavos never mind about the "sql server change tracking" feature. I found it: https://learn.microsoft.com/en-us/sql/relational-databases/track-changes/track-data-changes-sql-server — tedi, Aug 21 '17 at 08:39
@JedatKinports Please elaborate your question. 1. How much data you have in your MSSQL database? 2. Do you need to maintain synchronous copies up-to-date continually or just synchronize them periodically? 3. What's you system configuration (number of servers, operating systems) 4. Any limitations you may have (e.g. operating system, system, timing etc) — Vladimir Kroz, Aug 23 '17 at 21:41

score 18 · Answer 1 · 2017-08-23T17:10:28.873

Your question is on the broad side - so this is a pointer to some options.

Elastic search is used to query the database and analyse the data.

In the article Deprecating Rivers:

Client libraries

For more than a year, we've had official client libraries for Elasticsearch in most programming languages. It means that hooking into your application and getting data through an existing codebase should be relatively simple. This technique also allows to easily munge the data before it gets to Elasticsearch. A common example is an application that already used an ORM to map the domain model to a database, and hooking and indexing the domain model back to Elasticsearch tends to be simple to implement.

There's extensive documentation on how to use elastic search in:

Elasticsearch.Net.

The docs will address the following:

Install the package:

PM> Install-Package Elasticsearch.Net

Connection

var node = new Uri("http://mynode.example.com:8082/apiKey");  
var config = new ConnectionConfiguration(node);  
var client = new ElasticsearchClient(config);`

Security

Pooling and Failover

Building requests

This is what you'll need to develop.

Response handling

Error handling

Plugins

Logstash can also be used instead of Rivers, from which various plugins have been developed.

Also, Logstash, or similar tools, can be used to ship data into Elasticsearch. For example, some of the rivers Elasticsearch came with are now implemented as Logstash plugins (like the CouchDB one) in the forthcoming Logstash 1.5.

Extra reading

Although this is a different language and framework - the blog Advanced Search for Your Legacy Application by David Pilato and information may be helpful to browse. He recommends doing it in the application layer.

To address issues from the comments.

Data changes can be tracked.

SQL Server provides an inbuilt system to track data changes, an effective means of automatically tracking changes to the data without having to implement manual methods to check for changes.

There's two means by which to acheive this:

Using Change Data Capture:

Data changes are tracked with timestamps. The history of the data changes can be tracked.

Change data capture provides historical change information for a user table by capturing both the fact that DML changes were made and the actual data that was changed. Changes are captured by using an asynchronous process that reads the transaction log and has a low impact on the system.

Using Change Tracking:

This has less overheads, but does not keep track of historical changes. The latest changes are kept, but no further back.

Change tracking captures the fact that rows in a table were changed, but does not capture the data that was changed. This enables applications to determine the rows that have changed with the latest row data being obtained directly from the user tables. Therefore, change tracking is more limited in the historical questions it can answer compared to change data capture. .../...

score 2 · Answer 2 · answered Aug 23 '17 at 20:35

2

You can use Logstash to do the job. Simply use the logstash JDBC plugin to setup a logstash pipeline. Follow this link :-Migrate MySQL data to ElasticSearch

Also checkout this repo on GitHub ElasticSearchCRUD

answered Aug 23 '17 at 20:35

Sharthak Ghosh

576
1
9
22

For SQL server check this out https://shariefmohatad.wordpress.com/2016/03/29/ms-sql-config-with-logstash/ It can be done in a similar way using JDBC – Sharthak Ghosh Aug 24 '17 at 02:01

score 2 · Answer 3 · answered Aug 24 '17 at 00:24

So, just my 2¢ on implementation of this. In the past I've done this by setting a trigger to write to a buffer table that acted as an event-log. Then I had a serverless function (AWS Lambda) on a timer that would come clear out that event-log and push the needed changes into ES. This way I didn't have to do anything crazy in the trigger or really even change my original code.

score 2 · Answer 4 · answered Oct 26 '18 at 04:25

I've come across this post multiple times and feel it needs an updated answer.

For shipping data from a mssql instance into Elasticsearch I use Logstash which is inherent to the ELK stack. You define individual pipe lines and configurations using the jdbc input plug in.

Here is an example config file. This runs a stored procedure every 2 minutes and inserts the data into the correct index. Keep in mind to provide some method to only sync new records of data otherwise you'll have a scaling issue when data becomes large.

input {  
    jdbc {
        jdbc_connection_string => "jdbc:sqlserver://${sql_server};database=api;user=<username>;password=<password>;applicationname=logstash"
        # The user we want to execute our statement as
        jdbc_user => nil
        id => "index_name"
        jdbc_driver_library => "/var/tmp/mssql-jdbc-6.2.2.jre8.jar"
        jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver"
        schedule => "*/2 * * * *"
        statement => "exec stored_procedure"
        lowercase_column_names => false

    }

}


output {
    elasticsearch {
        "id" => "index_name"
        "hosts" => "elasticsearch:9200"
        "index" => "index_name"
        "document_id" => "%{primary_key}" 

    }

}

`

The best solution I had found to only sync up new records is to rely upon your own offsets. So I created a table that per each pipeline kept the last time it had ran, so each time the stored procedure would run, it'd grab the current offset and get records only newer. — Jeff Beagley, Jan 09 '20 at 17:08

score 1 · Answer 5 · answered Aug 25 '17 at 10:36

Even though the question is asking on syncing from MSSQL --> ElasticSearch, i feel that the basic idea to sync across Heterogeneous Systems will be quite same. You may need to

Define and Create batches of data to be sync'ed
Track the last batch synced in order to determine From where to begin, basically markers
Transform the Data
Transport a batch finally

This article Continuous data sync across Hetereogeneous Systems - YoursAndMyIdeas explains all the details to achieve this in more detail.

How to sync MSSQL to Elasticsearch?

5 Answers5

Linked