1

Suppose I have Elasticsearch indexes in the following order:

index-2022-04
index-2022-05
index-2022-06
...

index-2022-04 represents the data stored in the month of April 2022, index-2022-05 represents the data stored in the month of May 2022, and so on. Now let's say in my query payload, I have the following timestamp range:

"range": {
    "timestampRange": {
        "gte": "2022-04-05T01:00:00.708363",  
        "lte": "2022-06-06T23:00:00.373772"                 
    }
}

The above range states that I want to query the data that exists between the 5th of April till the 6th of May. That would mean that I have to query for the data inside three indexes, index-2022-04, index-2022-05 and index-2022-06. Is there a simple and efficient way of performing this query across those three indexes without having to query for each index one-by-one?

I am using Python to handle the query, and I am aware that I can query across different indexes at the same time (see this SO post). Any tips or pointers would be helpful, thanks.

buddemat
  • 4,552
  • 14
  • 29
  • 49
gusgus
  • 13
  • 4

2 Answers2

1

You simply need to define an alias over your indices and query the alias instead of the indexes and let ES figure out which underlying indexes it needs to visit.

Eventually, for increased search performance, you can also configure index-time sorting on timestampRange, so that if your alias spans a full year of indexes, ES knows to visit only three of them based on the range constraint in your query (2022-04-05 -> 2022-04-05).

Val
  • 207,596
  • 13
  • 358
  • 360
0

Like you wrote, you can simply use a wildcard in and/or pass a list as target index.

The simplest way would be to to just query all of your indices with an asterisk wildcard (e.g. index-* or index-2022-*) as target. You do not need to define an alias for that, you can just use the wildcard in the target string, like so:

from elasticsearch import Elasticsearch

es_client = Elasticsearch('https://elastic.host:9200')

datestring_start = '2022-04-05T01:00:00.708363'
datestring_end = '2022-06-06T23:00:00.373772'

result = es_client.search(
             index = 'index-*',  
             query = { "bool": {
                         "must": [{ 
                             "range": {  
                                 "timestampRange": {
                                      "gte": datestring_start,  
                                      "lte": datestring_end                 
                                 }
                             }
                         }]
                     }
                 })

This will query all indices that match the pattern, but I would expect Elasticsearch to perform some sort of optimization on this. As @Val wrote in his answer, configuring index-time sorting will be beneficial for performance, as it limits the number of documents that should be visited when the index sort and the search sort are the same.

For completeness sake, if you really wanted to pass just the relevant index names to Elasticsearch, another option would be to first figure out on the Python side which sequence of indices you need to query and supply these as a comma-separated list (e.g. ['index-2022-04', 'index-2022-05', 'index-2022-06']) as target. You could e.g. use the Pandas date_range() function to easily generate such a list of indices, like so

from elasticsearch import Elasticsearch
import pandas as pd

es_client = Elasticsearch('https://elastic.host:9200')

datestring_start = '2022-04-05T01:00:00.708363'
datestring_end = '2022-06-06T23:00:00.373772'

months_list = pd.date_range(pd.to_datetime(datestring_start).to_period('M').to_timestamp(), datestring_end, freq='MS').strftime("index-%Y-%m").tolist()

result = es_client.search(
             index = months_list,
             query = { "bool": {
                         "must": [{ 
                             "range": {  
                                 "timestampRange": {
                                      "gte": datestring_start,  
                                      "lte": datestring_end                 
                                 }
                             }
                         }]
                     }
                 })
buddemat
  • 4,552
  • 14
  • 29
  • 49
  • Thanks for the answer, would specifying a wildcard in your first approach affect the search speed/performance of the query? I was also leaning towards your second option before writing this question, but was wondering if there was an alternative option - hence this question – gusgus Apr 19 '22 at 13:07
  • It's not necessary to "figure out" anything from Python's side or to use wildcard either. An alias (which exist for a reason) + index time sorting as I suggested in my answer is all is needed to make it work flawlessly. – Val Apr 19 '22 at 13:20
  • @Val: I agree, at least on the index time sorting part. But defining an alias with a wildcard (which is what would be needed here, right?) is no different than directly using a wildcard in the first place though, is it? As for the alternative, I only added that because the question specifically was asking for a way to pass the three index names to elasticsearch. I' update my answer to make that more clear. – buddemat Apr 20 '22 at 13:39
  • Aliases are a bit more than just a wildcarded index. First, you can add filters to them (to only provide a view ont he data) and second you can reindex/reorganize your indexes without the user noticing anything by continuously querying over the alias without having to worry about how indexes are named/organized underneath – Val Apr 20 '22 at 14:03
  • That's true, though not directly relevant in this specific case. But your answer is very nice and I was not completely aware of the index time sorting, so you have my upvote. Cheers! – buddemat Apr 20 '22 at 21:54