I want to sync two dependent databases (elasticsearch and casandra) with my parent database: postgres. I am trying to implement a method in this article: https://qafoo.com/blog/086_how_to_synchronize_a_database_with_elastic_search.html. So I came up with 2 methods
Sync before updating/inserting data into dependent databases
router.put('/account/edit', function(req, res) { syncElasticWithDatabase().then(() => { elastiClient.update({...}); // client for elasticsearch cassandraClient.execute({...}); // client for cassandra req.end(); }) })
syncElasticWithDatabase()
uses data in updates
table (from postgres), this method can be slow since some people would have to wait for syncElasticWithDatabase()
to finish. I like this method because I leverage sequantial_ids
(check article for details). The data is synced before new data comes in, allowing dependencies to catch up and only missed out data will be synced. Preventing reindex/reinsert unlike options 2 below.
Using a backround process (ei: running every 24 hours), I could sync data by selecting "missed out data" from
update_error
table, which contains data when elasticsearch or cassandra fail. Here's a rough examplerouter.put('/account/edit', function(req, res) { psqlClient.query('UPDATE....').then(() => { elastiClient.update({...}); // client for elasticsearch cassandraClient.execute({...}); // client for cassandra }).catch(err => { psqlClient.query('INERT INTO update_error ....') }) })
However this method would require to reindex or reinsert data, because in some cases elasticsearch could insert data while cassandra didn't or either way. Because of this I will need a separate column that will record database type that failed. This way I can select data that failed since the last synchronization time for each type of database (ealsticsearch or cassandra).
Questions:
Method 1 seems perfect, but this would mean some people would have to wait for longer than others to update their account due to
syncElasticWithDatabase()
. However the article above does exactly the same (look at their diagram) or I am misunderstanding something?Because of the delay described above (if I'm correct), I introduced option 2. However it's just too much in order to sync IMHO. Yet I spent a good time thinking about this... So are there easier or better methods than 1 and 2?
Would Apache Zoo Keeper help in my case?
Thanks :)
Other reference
Sync elasticsearch on connection with database - nodeJS
https://gocardless.com/blog/syncing-postgres-to-elasticsearch-lessons-learned/