3

I want to sync two dependent databases (elasticsearch and casandra) with my parent database: postgres. I am trying to implement a method in this article: https://qafoo.com/blog/086_how_to_synchronize_a_database_with_elastic_search.html. So I came up with 2 methods

  1. Sync before updating/inserting data into dependent databases

     router.put('/account/edit', function(req, res) { 
    
      syncElasticWithDatabase().then(() => { 
             elastiClient.update({...}); // client for elasticsearch
             cassandraClient.execute({...}); // client for cassandra
              req.end();
      })
    
     })
    

syncElasticWithDatabase() uses data in updates table (from postgres), this method can be slow since some people would have to wait for syncElasticWithDatabase() to finish. I like this method because I leverage sequantial_ids (check article for details). The data is synced before new data comes in, allowing dependencies to catch up and only missed out data will be synced. Preventing reindex/reinsert unlike options 2 below.

  1. Using a backround process (ei: running every 24 hours), I could sync data by selecting "missed out data" from update_error table, which contains data when elasticsearch or cassandra fail. Here's a rough example

     router.put('/account/edit', function(req, res) { 
    
      psqlClient.query('UPDATE....').then(() => { 
           elastiClient.update({...}); // client for elasticsearch
           cassandraClient.execute({...}); // client for cassandra
      }).catch(err => {
          psqlClient.query('INERT INTO update_error ....')
          })
    
      })
    

    However this method would require to reindex or reinsert data, because in some cases elasticsearch could insert data while cassandra didn't or either way. Because of this I will need a separate column that will record database type that failed. This way I can select data that failed since the last synchronization time for each type of database (ealsticsearch or cassandra).

Questions:

  1. Method 1 seems perfect, but this would mean some people would have to wait for longer than others to update their account due to syncElasticWithDatabase(). However the article above does exactly the same (look at their diagram) or I am misunderstanding something?

  2. Because of the delay described above (if I'm correct), I introduced option 2. However it's just too much in order to sync IMHO. Yet I spent a good time thinking about this... So are there easier or better methods than 1 and 2?

  3. Would Apache Zoo Keeper help in my case?

Thanks :)


Other reference

Sync elasticsearch on connection with database - nodeJS

https://gocardless.com/blog/syncing-postgres-to-elasticsearch-lessons-learned/

Community
  • 1
  • 1
CODE
  • 95
  • 1
  • 2
  • 7

1 Answers1

0

Basically, you'll need to use method described here https://qafoo.com/blog/086_how_to_synchronize_a_database_with_elastic_search.html and insert & select data from one database table. But make sure you limit the number of selects when selection data in "updates" eg: LIMIT 100.

Here's the workflow:

  1. save data to "updates" table during insert/update (if delete make sure you mark it as deleted in a column)
  2. then run this process ->

    • select you last insert: sequence_id from elasticsearch or Cassandra
    • use it to select data from "updates" table like so: id > :sequence_id

You can then insert data (into elasticsearch or cassandra) or do whatever. Make sure you insert data into "updates" table before dependent databases. And there is no need to duplicate document_id so replace them with new one. This gives consistency and allows you to choose between running cron job or sync it during a specific action all at once. Then update your sequence_id to the last one.

I choose to sync data straight after insert/update/delete to "updates", Then I do res.end() (or whatever to finish response) and use sync() function to select 100 new records in ascending order. I also run a cron job every 24 hours (without LIMIT 100) to make sure any data that was left out will be synced. Ohh yeah, and if updates were successful for all databases, then you might as well delete records from "updates" unless you use it for some other reason. But note that elasticsearch can loose data in memory

Good luck :) And I am opened to suggestions

CODE
  • 95
  • 1
  • 2
  • 7