2

I'm creating a Java multi-tenant application. Considering a quite small tenant size of 100, I started to think how to scale things up. In my application each tenant has a list of products. Each tenant can import products from a main HUGE list of 1 milion records.

So, if every tenant import all products, I would have a ES index of 100 milions documents. Each document has 30 fields.

Because that huge list of products is the same for all tenant's, I was thinking to avoid to replicate data for each tenant but to have a central index with 1 milion products to query directly.

So, in the end I would have:

  1. One cluster for the main central product list
  2. One or more cluster for tenant's indexes

When the tenant wants to search a product a cross cluster query would be performed (https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cross-cluster-search.html) to find "on the fly" all products from the main index + products from the tenant's index.

In fact, when the tenant wants to edit a product from the main index, that product is copied also in the tenant's index.

So here another problem arises: I need to remove duplicates (the product modified is still the same of the main index with just some changes like the price). How to do that? I can use an aggregation function how described here: https://stackoverflow.com/a/29886871/2012635

So finally my question are:

  1. Is the central ONE BIG index modeling better than having 100 BIG indexes? I should save money because I would have much much less data.
  2. Is the cross cluster query too expensive to do considering that I've also to use aggregation to remove duplicates?
  3. Is there a better approach more suitable for my requirements?

A typical search query I've to performe looks like this:

{
  "bool" : {
    "filter" : [     
      {
        "bool" : {
          "must" : [
            {
              "bool" : {
                "must" : [
                  {
                    "range" : {
                      "sphereMin" : {
                        "from" : "-17",
                        "to" : null,
                        "include_lower" : true,
                        "include_upper" : true,
                        "boost" : 1.0
                      }
                    }
                  },
                  {
                    "range" : {
                      "sphereMax" : {
                        "from" : null,
                        "to" : "5",
                        "include_lower" : true,
                        "include_upper" : true,
                        "boost" : 1.0
                      }
                    }
                  }
                ],
                "adjust_pure_negative" : true,
                "boost" : 1.0
              }
            }
          ],
          "should" : [
            {
              "range" : {
                "sphereMin" : {
                  "from" : null,
                  "to" : "-17",
                  "include_lower" : true,
                  "include_upper" : true,
                  "boost" : 1.0
                }
              }
            }
          ],
          "adjust_pure_negative" : true,
          "boost" : 1.0
        }
      },
      {
        "bool" : {
          "must" : [
            {
              "bool" : {
                "must" : [
                  {
                    "range" : {
                      "sphereMax" : {
                        "from" : "-17",
                        "to" : null,
                        "include_lower" : true,
                        "include_upper" : true,
                        "boost" : 1.0
                      }
                    }
                  },
                  {
                    "range" : {
                      "sphereMax" : {
                        "from" : null,
                        "to" : "5",
                        "include_lower" : true,
                        "include_upper" : true,
                        "boost" : 1.0
                      }
                    }
                  }
                ],
                "adjust_pure_negative" : true,
                "boost" : 1.0
              }
            }
          ],
          "should" : [
            {
              "range" : {
                "sphereMax" : {
                  "from" : "5",
                  "to" : null,
                  "include_lower" : true,
                  "include_upper" : true,
                  "boost" : 1.0
                }
              }
            }
          ],
          "adjust_pure_negative" : true,
          "boost" : 1.0
        }
      }
    ],
    "adjust_pure_negative" : true,
    "boost" : 1.0
  }
}

I've also some aggregations and matchQuery on a edge_ngram filter.

drenda
  • 5,846
  • 11
  • 68
  • 141

1 Answers1

1

100m is not that big, depends on the resources available, latency requirements etc though. It is not clear why you need separate clusters (and cross-cluster search) what seems to fit here better is searching in multiple indexes, or aliasing. Another thing that is not clear is the need to include the original product index in search query and then handle duplicates.

To answer your questions:

Is the central ONE BIG index modeling better than having 100 BIG indexes? I should save money because I would have much much less data.

100 indexes give you more flexibility for scaling and querying.

Is the cross cluster query too expensive to do considering that I've also to use aggregation to remove duplicates?

Too cheap or too expensive - it depends. If everything is filtered out and only a few documents match the query then deduplication is "cheap". But, again, better

Is there a better approach more suitable for my requirements?

If 100 tenants (or 100m docs) is not your limit and you want to scale horizontally then having separate indexes is a better approach. Using one big index will require re-sharding every time you hit vertical scaling limits.

khachik
  • 28,112
  • 9
  • 59
  • 94
  • 100 tenants is an optimistic value. I could reach 1000. More clusters because with many customers at some point it's cheaper to have several cluster rather than a big cluster (of course with BIG numbers). I need to include the original index because if the tenant doesn't change values in the products, none is copied in his index. I.e. the central index has 1M products, the tenant changes 1 products so his personal index has 1 product while he still wants to see also other 999.999 products. The main index remains always of the same size more or less. Tenant's indexes will grow quite slowly – drenda Jan 31 '19 at 23:16
  • because each day tenants would add few products or edit few "preset" products from the main index. – drenda Jan 31 '19 at 23:17
  • @drenda why do you think it is cheaper to have more clusters and than one big cluster? ES is designed for horizontal scaling. – khachik Jan 31 '19 at 23:28
  • 1
    You might want to consider child-parent, but there are performance considerations https://blog.trifork.com/2016/12/22/handling-a-massive-amount-of-product-variations-with-elasticsearch/ – khachik Jan 31 '19 at 23:30