How reliable is ElasticSearch as a primary datastore against factors like write loss, data availability

Question

I am working on a project with a requirement of coming up with a generic dashboard where a users can do different kinds of grouping, filtering and drill down on different fields. For this we are looking for a search store that allows slice and dice of data.

There would be multiple sources of data and would be storing it in the Search Store. There may be some pre-computation required on the source data which can be done by an intermediate components.

I have looked through several blogs to understand whether ES can be used reliably as a primary datastore too. It mostly depends on the use-case we are looking for. Some of the information about the use case that we have :

Around 300 million record each year with 1-2 KB.
Assuming storing 1 year data, we are today with 300 GB but use-case can go up to 400-500 GB given growth of data.
As of now not sure, how we will push data, but roughly, it can go up to ~2-3 million records per 5 minutes.
Search request are low, but requires complex queries which can search data for last 6 weeks to 6 months.
document will be indexed across almost all the fields in document.

Some blogs say that it is reliable enough to use as a primary data store -

And some blogs say that ES have few limitations -

Has anyone used Elastic Search as the sole truth of data without having a primary storage like PostgreSQL, DynamoDB or RDS? I have looked up that ES has certain issues like split brains and index corruption where there can be a problem with the data loss. So, I am looking to know if anyone has used ES and have got into any troubles with the data

Thanks.

We are at the edge of a similar design decision with slightly bigger data requirements. We plan to support ES with Riak in addition to regular snapshots with replayable Kafka log. The decisive figures in our case are two fold: 1) growing segmentation due to high update rate and 2) performance impact of updates on reads. I strongly recommend you to simulate your load and run a couple of benchmarks. That being said, we (bol.com, the biggest online retailer in Netherlands and Belgium) has been using ES 5.x on production for 2 years without a single hiccup. Good luck and keep us posted on updates. — Volkan Yazıcı, Feb 20 '18 at 20:34
Sow how was your experience Harshit? it's been 3 years since this post :) — Leo Gallucci, Nov 15 '18 at 19:07

score 38 · Answer 1 · answered Jul 13 '15 at 08:47

38

Short answer: it depends on your use case, but you probably don't want to use it as a primary store.

Longer answer: You should really understand all of the possible issues that can come up around resiliency and data loss. Elastic has some great documentation of these issues which you should really understand before using it as a primary data store. In addition Aphyr's post on the topic is a good resource.

If you understand the risks you are taking and you believe that those risks are acceptable (e.g. because small data loss is not a problem for your application) then you should feel free to go ahead and try it.

answered Jul 13 '15 at 08:47

Cory

22,772
19
94
91

I am not sure what about performance of adding new data to elastic-search. As everything need to index, all related index should be updated. However, we could manually specify index we need in other No-SQL. Fox example, the document is {name:"ricky", age:18}. We could only need to update index for 'name' in No-SQL, but we need to update both 'name' and 'age' in elastic-search. This could be a potential performance issue. Please figure it out, if I am wrong. – Ricky Jiao Sep 27 '16 at 09:40
1

Here is another question that is also relevant for this topic: https://stackoverflow.com/questions/27054954/elasticsearch-vs-cassandra-vs-elasticsearch-with-cassandra – zsltg Oct 07 '17 at 09:16

score 12 · Answer 2 · answered Apr 24 '15 at 07:57

It is generally a good idea to design redundant data storage solutions. For example, it could be a fast and reliable approach to first just push everything as flat data to a static storage like s3 then have ES pull and index data from there. If you need more flexibility leveraging some ORM, you could have an RDS or Redshift layer in between. This way the data can always be rebuilt in ES.

It depends on your needs and requirements how you set the balance between redundancy and flexibility/performance. If there's a lot of data involved, you could store the raw data statically and just index some parts of it by ES.

Amazon Lambda offers great features:

Many developers store objects in Amazon S3 while using Amazon DynamoDB to store and index the object metadata and enable high speed search. AWS Lambda makes it easy to keep everything in sync by running a function to automatically update the index in Amazon DynamoDB every time objects are added or updated from Amazon S3.

score 1 · Answer 3 · answered Oct 11 '22 at 17:09

Since 2015 when this question was originally posted a lot of resiliency issues have been found and addressed, and in recent years a lot of features and specifically stability and resiliency features have been added, that it's definitely something to consider given the right use-cases and leveraging the right features in the right way.

So as of 2022, my answer to this question is - yes you can, as long as you do it correctly and for the right use-case.

score -1 · Answer 4 · answered May 30 '23 at 16:40

During day-to-day conversations with customers, we often encounter people that either want to use Elasticsearch as their primary data store or that have already decided to use it that way. But this is actually something we discourage. Below, I will explain a few of the reasons why we discourage using Elasticsearch as your application’s primary data store.It is a search engine not a databaseSearch engines serve a fundamentally different purpose than a database. Most databases are ACID compliant. Elasticsearch is not which means it is inherently riskier to use it like a database. Among other idiosyncrasies, Elasticsearch offers atomicity only on a per-document basis, not on a transaction basis.To understand the problem, let’s look at a real-world scenario—a transaction with your bank account. A customer makes a purchase and the amount is debited (removed) from their account balance and then credited (added) to the vendor’s account balance. If one of these operations fails, say, because the customer doesn’t have enough funds, then neither account should be modified. Otherwise, this could lead to the vendor being credited with money that wasn’t debited from anywhere, which would be a problem (unless you’re the lucky vendor!).With an ACID-compliant data store, each transaction ensures that all operations succeed or fail at once, keeping the database in a consistent state. But, Elasticsearch doesn’t provide this option. It’s possible to issue a Bulk call that reduces a count for a customer record and increases a count for a vendor record, and if one fails, the other might succeed. This can really mess things up.

copy, paste from that link : https://bonsai.io/blog/why-elasticsearch-should-not-be-your-primary-data-store#:~:text=Elasticsearch%20focuses%20on%20making%20data,consistency%20is%20sacrificed%20for%20expediency. — Abdelrahman Elayashy, Aug 20 '23 at 23:08

How reliable is ElasticSearch as a primary datastore against factors like write loss, data availability

4 Answers4

Linked