4

For example, I have 1000 users. The data of each user is not big, maximum is 1GB. So I have 2 strategies for indexing.

  • Big Indexing: I will have a single index. Then every time a user searches some data, I will add a user_id into the query.
  • Small indexing: Every user is an Elasticsearch index. Because the data is not huge, we only need 1-2 shards.

My opinion is the second method is a lot faster because we don't need to add user_id into the query. The first method might be slower because it will go to many shards and at the same time, it must count user_id into the query.

However, there are some ref1 ref2 that they recommend we should keep the total number of shards relatively small.

In a practical environment, what is a good solution for my situation?

Kate Orlova
  • 3,225
  • 5
  • 11
  • 35
Trần Kim Dự
  • 5,872
  • 12
  • 55
  • 107

1 Answers1

5

It's a waste of resource to create one index per user, especially if you have 1000+ users. if your app is successful and your user base grows, so will the count of indices and the number of shards as a result. Even with one shard per index, having 1000 shards is already using up quite a big amount of resources.

It's much more efficient to have a single index and throw all your users in it with a user_id field to discriminate each user's data.

Val
  • 207,596
  • 13
  • 358
  • 360
  • I agree that using an index-per-user is a little weird. Can I ask 2 questions: 1) What is the overhead cost of index/shard and why? Do we have any resources to check/research about this? 2) As I mentioned before, does big index is slower when users increase, but actually all we need is local search? In that case, can we easily migrate the index? For example, index of all user with ID <= 10,000 and others? – Trần Kim Dự Mar 02 '20 at 12:22
  • 1
    You can definitely have more than one index and partition your users depending on their ID or whatever attribute they might have. But clearly not one index per user. Each shard is a standalone Lucene search engine that consumes resources (CPU, heap, bandwidth, etc). – Val Mar 02 '20 at 12:24
  • For the last part of the Lucene search engine, is that the trade-off between computing resources and speed? (as more shards, more concurrency computation)? ? – Trần Kim Dự Mar 02 '20 at 12:27
  • I don't understand your last question – Val Mar 02 '20 at 14:05
  • I mean: As I understand, because each shard is a Lucene's index and ElasticSearch can calculate concurrently on all shards. Does that mean the trade-off is: We will cost a lot of computing resource (CPU, RAM) for performance? – Trần Kim Dự Mar 02 '20 at 14:23
  • The rule of thumb is that the more shards you add the more parallelism you gain... but that is up to a certain point after which the performance doesn't increase, but decreases and things start to get slower again because too many shards consume too many resources. – Val Mar 02 '20 at 14:28
  • My apologies for the late reply. I have researched this and I have some useful information. 1) For my problem, there is a feature named "routing" that suits with my requirements. 2) About many indices, based on ElasticSearch blog, if Imove from 1 index to 2 index, the storage will be nearly double. – Trần Kim Dự Mar 10 '20 at 11:50