I'm creating a Java multi-tenant application. Considering a quite small tenant size of 100, I started to think how to scale things up. In my application each tenant has a list of products. Each tenant can import products from a main HUGE list of 1 milion records.
So, if every tenant import all products, I would have a ES index of 100 milions documents. Each document has 30 fields.
Because that huge list of products is the same for all tenant's, I was thinking to avoid to replicate data for each tenant but to have a central index with 1 milion products to query directly.
So, in the end I would have:
- One cluster for the main central product list
- One or more cluster for tenant's indexes
When the tenant wants to search a product a cross cluster query would be performed (https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cross-cluster-search.html) to find "on the fly" all products from the main index + products from the tenant's index.
In fact, when the tenant wants to edit a product from the main index, that product is copied also in the tenant's index.
So here another problem arises: I need to remove duplicates (the product modified is still the same of the main index with just some changes like the price). How to do that? I can use an aggregation function how described here: https://stackoverflow.com/a/29886871/2012635
So finally my question are:
- Is the central ONE BIG index modeling better than having 100 BIG indexes? I should save money because I would have much much less data.
- Is the cross cluster query too expensive to do considering that I've also to use aggregation to remove duplicates?
- Is there a better approach more suitable for my requirements?
A typical search query I've to performe looks like this:
{
"bool" : {
"filter" : [
{
"bool" : {
"must" : [
{
"bool" : {
"must" : [
{
"range" : {
"sphereMin" : {
"from" : "-17",
"to" : null,
"include_lower" : true,
"include_upper" : true,
"boost" : 1.0
}
}
},
{
"range" : {
"sphereMax" : {
"from" : null,
"to" : "5",
"include_lower" : true,
"include_upper" : true,
"boost" : 1.0
}
}
}
],
"adjust_pure_negative" : true,
"boost" : 1.0
}
}
],
"should" : [
{
"range" : {
"sphereMin" : {
"from" : null,
"to" : "-17",
"include_lower" : true,
"include_upper" : true,
"boost" : 1.0
}
}
}
],
"adjust_pure_negative" : true,
"boost" : 1.0
}
},
{
"bool" : {
"must" : [
{
"bool" : {
"must" : [
{
"range" : {
"sphereMax" : {
"from" : "-17",
"to" : null,
"include_lower" : true,
"include_upper" : true,
"boost" : 1.0
}
}
},
{
"range" : {
"sphereMax" : {
"from" : null,
"to" : "5",
"include_lower" : true,
"include_upper" : true,
"boost" : 1.0
}
}
}
],
"adjust_pure_negative" : true,
"boost" : 1.0
}
}
],
"should" : [
{
"range" : {
"sphereMax" : {
"from" : "5",
"to" : null,
"include_lower" : true,
"include_upper" : true,
"boost" : 1.0
}
}
}
],
"adjust_pure_negative" : true,
"boost" : 1.0
}
}
],
"adjust_pure_negative" : true,
"boost" : 1.0
}
}
I've also some aggregations and matchQuery on a edge_ngram filter.