1

What is the best way to know when an Elastic Search index needs updating, assuming the upstream data source is transactional SQL Server tables with inserts, updates and deletes?

Examples: Tables Parent, Child, Grandchild.

Parent      |  Child                | Grandchild
ID   Name   |  ID  ParentID  Name   | ID  ChildID Amount
1    Foo    |  10   1         Bike  | 100 10      5
2    Bar    |  20   1         Car   | 200 20      2
3    Baz    |  30   3         Tran  | 300 30      1

Grandchild is updated, and Elastic Search index on Parent needs to be updated for the associated record.

So, on Grandchild updated, I need to find the Parent.ID for that Grandchild. Which means joining to Child and getting the ParentID value.

At the same time, we are starting a incremental, iterative loaded data warehouse initiative, so ideally I would like to use the same SQL Server API/technique for both.

Based on comments in How to notify a windows service(c#) of a DB Table Change(sql 2005)? by Remus Rusanu, Query Notification API should not be used, as its only intended use is cache invalidation, not change tracking...

Which seems to leave two options - SQL Server Change Data Capture, and SQL Server Change Tracking API.

We thought about doing all change tracking in the application level, but our main concerns are out-of-band updates due to the fact some data needs to be updated over night in unforeseen ways due to new government regulations, and so we really need a way to capture changes at the table-level and bubble it up into a queue to feed Elastic Search.

Thank you!

Community
  • 1
  • 1
John Zabroski
  • 2,212
  • 2
  • 28
  • 54

2 Answers2

2

The appropriate API for this is either change tracking or change data capture. Which one depends on the frequency/volume of data changes and on how big latency you can afford between the original data and the search index. For low latency and frequent changes CDC is better imho, as it can give you a 'delta' with minimal cost. For slow changing data and an infrequent Elastic Search index refresh maybe I would prefer CT as is more lightweight, although figuring out the 'delta' is more complex (I say maybe because in general I found CDC better suited than CT over long term solutions, as requirements evolve CDC ends up as a better fit).

The usual problem with tracking changes is finding out what was deleted. In house solutions, based on triggers or implemented in app layer, always have problems with that part. Is not impossible to do it, but you'll end up re-implementing CT/CDC on your own, without access to the internals of SQL log parsing and extra update logging that CDC leverages...

Remus Rusanu
  • 288,378
  • 40
  • 442
  • 569
  • Can you define 'frequent' in terms of # of updates? Only some of our customers use Enterprise Edition. I suppose I could create a 'split' configuration for customers that use EE and would benefit from it, but it seems like a bunch of busy work. I can see 2 million updates a month to various indexes. – John Zabroski Dec 02 '15 at 16:29
  • 1
    2 million updates a month is ~1/sec. Polling for changes every ~1min would yield ~50-60 changes that need to be propagated. I would use CDC for this, not CT. EE Licensing constraints is a valid point. – Remus Rusanu Dec 02 '15 at 21:08
  • Change Data Capture is now available in Std Edition since SQL Server 2016 (13.x) SP1 https://learn.microsoft.com/en-us/sql/sql-server/editions-and-components-of-sql-server-2016 - also, to follow up, benchmarking Change Tracking + Stitching CT objects from 18 tables together resulted in ~300ms delay. I am no longer working for the same company but thought if anyone else reads this comment, might appreciate the guidance on performance – John Zabroski Aug 15 '18 at 14:04
2

This guy as in interesting solution using triggers, the built-in ServiceBroker to queue the changes and a C# service to read that queue and push changes to elastic search: https://medium.com/@mindingdata/elasticsearch-realtime-rivers-with-mssql-server-e1540a9bf1d3#.72k9buet5

the architecture is like a CDC but using service broker to store the changes instead of the CDC tables

Hector Salazar
  • 141
  • 1
  • 6
  • Thanks. My friend came up with another hack, which is to use " RAISEEror with no wait" but it's not guaranteed to reliably deliver messages. – John Zabroski Nov 25 '18 at 01:06