0

In our project, we are hitting the elastic search's index refresh api after each create/update/delete operation for immediate search availability.

I want to know, how elastic search will perform if multiple parallel requests are made to its refresh api on single index having close to 2.5million documents?

any thoughts? suggestions?

ms_27
  • 1,484
  • 15
  • 22
  • 1
    Could you expand on what you want to achieve with that ? For example, you may hit one node with a refresh index query, but you're not even sure the replicas have ingested the data yet, so you may not even achieve what you think you would. Elastic is already near real time, and you can setup auto refresh rates (the default `index.refresh_interval` is 1 second). So unless you *know* a use case where your app would insert and then *immediately* fetch back you may not even need to actually refresh. – GPI Sep 28 '18 at 07:42
  • Yes there is such usecase. So, our application has a UI flow wherein the user creates a content and then immediately navigates to a listing screen where most recently created/modified contents are displayed on top. – ms_27 Sep 28 '18 at 08:14
  • Currently, after every CUD operation we hit refresh api. And, we are working on a feature where existing document count will increase two/three fold i.e. upto 7.5 million. I am wondering if we still continue to refresh after every CUD operation, then it will impact the elastic search performance in any way. By performance I mean, any delay in indexing, searching or more GC pauses, etc. – ms_27 Sep 28 '18 at 08:18
  • 1
    This answer may also help: https://stackoverflow.com/questions/31499575/how-to-deal-with-elasticsearch-index-delay/34391272#34391272 – Val Sep 28 '18 at 10:46

1 Answers1

1

Refresh is an operation where ElasticSearch asks Lucene shard to commit modification on disk and create a segment. If you ask for a refresh after every operation you will create a huge number of micro-segments.

Too many segments make your search longer as your shard need to sequentially search through all of them in order to return a search result. Also, they consume hardware resources.

Each segment consumes file handles, memory, and CPU cycles. More important, every search request has to check every segment in turn; the more segments there are, the slower the search will be. from the definitive guide

Lucene will merge those segments automatically into bigger segments, but that's also an I/O consuming task.

You can check this for more details

But from my knowledge, a refresh on a 2.5 billion documents index will take the same time in a 2.5k document index. Also, it seems ( from this issue ) that refresh is a non-blocking operation.

But its a bad pattern for an elasticsearch cluster. Are every CUD operation of your application in need for a refresh ?

Pierre Mallet
  • 7,053
  • 2
  • 19
  • 30