Message storage duplication for messaging systems

Question

In many sub-system designs for messaging applications (twitter, facebook e.t.c) I notice duplication of where user message history is stored. On other hand they use tokenizing indexer like ElasticSeach or Solr. It's good for search. On other hand still use some sort of DB for history. Why to duplicate? Why the same instance of ES/Solr/EarlyBird can not be used for history? It's in fact able to.

Mysterion · Answer 1 · 2019-01-13T11:59:26.630

The usual problem is the following - you want to search and also ideally you want to try index data in a different manner (e.g. wipe index and try new awesome analyzer, that you forgot to include initially). Separating data source and index from each other makes system less coupled. You're not afraid, that you will lose data in the Elasticsearch/Solr.

I am usually strongly against calling Elasticsearch/Solr a database. Since in fact, it's not. For example none of them have support for transactions, which makes your life harder, if you want to update multiple documents following standard relational logic.

Last, but not least - one of the hardest operation in Elasticsearch/Solr is to retrieve stored values, since it's not much optimised to do so, especially if you want to return 10k documents at once. In this case separate datasource would also help, since you will be able to return only matched document ids from Elasticsearch/Solr and later retrieve needed content from datasource and return it to the user.

Summary is just simple - Elasticsearch/Solr should be more think of as a search engines, not data storage.

Thanks for good useful answer! I expect from ES/Solr authors to take all these factors(very major ones) seriously unless they intend to stay in no-DB area cause indeed it's duplication of massive data. Regarding transactions for raw texts, BTW, I feel one need to reconsider design if one needs this, but who knows... — user1439579, Jan 13 '19 at 11:57
@user1439579 to be fair, im not sure if they should. It's perfectly fine that Elasticsearch/Solr have their niche and not trying to be everything — Mysterion, Jan 13 '19 at 11:59
but from customer perspective they should double the resources — user1439579, Jan 13 '19 at 12:18

Val · Accepted Answer · 2019-01-13T17:39:24.943

True that ES is NOT a database per se and will never be. But no one says you cannot use it as such, and many people actually do. It really depends on your specific use case(s), and in the end it's all a question of the trade-offs you are ready to make to support your specific needs. As with pretty much any technology in general, there is no one-size-fits-all approach and with ES (and the like) it's no different.

A primary source of truth might not necessarily be a relational DBMS and they are not necessarily "duplicating" the data in the sense that you meant, it can be anything that has a copy of your data and allows you to rebuild your ES indexes in case something goes wrong. I've seen many many different "sources of truth". It could simply be:

your raw flat files containing your historical logs or business data
Kafka topics that you can replay anytime easily
a snapshot that you take from ES on a regular basis
a relational DB
you name it...

The point is that if something goes wrong for any reason (and that happens), you want to be able to recreate your ES indexes, be it from a real DB, from backups or from raw data. You should see that as a safety net. Even if all you have is a MySQL DB, you usually have a backup of it, so you're already "duplicating" the data in some way.

One thing that you need to think of, though, when architecting your system, is that you might not necessarily need to have the entirety of your data in ES, since ES is a search and analytics engine, you should only store in there what is necessary to support your search and analytics needs and be able to recreate that information anytime. In the end, ES is just a subsystem of your whole architecture, just like your DB, your messaging queue or your web server.

Also worth reading: Using ElasticSeach as primary source for part of my DB

Thank you, Your analytics is just excellent! One more point: You mentioned replaying Kafka topics. Do you believe full history store in Kafka is good practice in such scenarios? — user1439579, Jan 14 '19 at 18:29

Message storage duplication for messaging systems

2 Answers2