1

I'm new to Elastic. I'm attempting to do a proof-of-concept for professional reasons. So far I'm very impressed. I've indexed a bunch of data and have run a few queries - almost all of which are super fast (thumbs up).

The only issue I'm encountering is that my date range query seems relatively slow compared to all my other queries. We're talking 1000ms+ compared to <100ms for everything else.

I am using the NEST .NET library.

My document structure looks like this:

{ 
   "tourId":"ABC123",
   "tourName":"Super cool tour",
   "duration":12,
   "countryCode":"MM",
   "regionCode":"AS",
   ...
   "availability":[ 
      { 
         "startDate":"2021-02-01T00:00:00",
         ...
      },
      { 
         "startDate":"2021-01-11T00:00:00",
         ...
      }
   ]
}

I'm trying to get all tours which have availability within a certain month. I am using a date range to do this. I'm not sure if there's a more efficient way to do this? Please let me know if so.

I have tried the following two query:

var response = await elastic.SearchAsync<Tour>(s => s
    .Query(q => q
        .Nested(n => n
            .Path(p => p.Availability)
            .Query(nq => nq
                .DateRange(r => r
                    .Field(f => f.Availability.First().StartDate)
                    .GreaterThanOrEquals(new DateTime(2020, 07, 01))
                    .LessThan(new DateTime(2020, 08, 01))
                )
            )
        )
    )
    .Size(20)
    .Source(s => s.IncludeAll().Excludes(e => e.Fields(f => f.Availability)))
);

I basically followed the example on their documentation here: https://www.elastic.co/guide/en/elasticsearch/client/net-api/current/writing-queries.html#structured-search but I'm not sure that this is the best way for me to achieve this. Is it just that a date range is naturally slower than other queries or am I just doing it wrong?!

EDIT:

I tried added a new field named YearMonth which was just an integer representing the year and month for each availability in the format yyyyMM and querying against this. The timing was also around one second. This makes me wonder whether it's not actually an issue with the date but something else entirely.

I have run a profiler on my query and the result is below. I have no idea what most of it means so if someone does and can give me some help that'd be great:

Query:

var response = await elastic.SearchAsync<Tour>(s => s
    .Query(q => q
        .Nested(n => n
            .Path(p => p.Availability)
            .Query(nq => nq
                .Term(t => t
                    .Field(f => f.Availability.First().YearMonth)
                    .Value(202007)
                )
            )
        )
    )
    .Size(20)
    .Source(s => s.IncludeAll().Excludes(e => e.Fields(f => f.Availability)))
    .Profile()
);

Profile:

{ 
   "Shards":[ 
      { 
         "Aggregations":[ 

         ],
         "Id":"[pr4Os3Y7RT-gXRWR0gxoEQ][tours][0]",
         "Searches":[ 
            { 
               "Collector":[ 
                  { 
                     "Children":[ 
                        { 
                           "Children":[ 

                           ],
                           "Name":"SimpleTopDocsCollectorContext",
                           "Reason":"search_top_hits",
                           "TimeInNanoseconds":6589867
                        }
                     ],
                     "Name":"CancellableCollector",
                     "Reason":"search_cancelled",
                     "TimeInNanoseconds":13981165
                  }
               ],
               "Query":[ 
                  { 
                     "Breakdown":{ 
                        "Advance":5568,
                        "BuildScorer":2204354,
                        "CreateWeight":25661,
                        "Match":0,
                        "NextDoc":3650375,
                        "Score":3795517
                     },
                     "Children":null,
                     "Description":"ToParentBlockJoinQuery (availability.yearMonth:[202007 TO 202007])",
                     "TimeInNanoseconds":9686512,
                     "Type":"ESToParentBlockJoinQuery"
                  }
               ],
               "RewriteTime":36118
            }
         ]
      }
   ]
}
Pieterjan
  • 2,738
  • 4
  • 28
  • 55
Andy Furniss
  • 3,814
  • 6
  • 31
  • 56

2 Answers2

1

Nevertheless, this seems like a data structure optimisation issue: Without changing too much you could convert all your available dates into Unix timestamp and then use Range query (quick conversion tips in C# can be found here).

Another one is to create monthly (or weekly, yearly depends on your data) indices and before executing your query filter out indices i.e. query only the indices you need. This would mean putting the same listings into multiple indices (duplicate documents in multiple indices) depending on the availability month/day.

Separating timestamp (time-series) data per certain index granularity is a common practice in ES. More info here.

The latter would mean that you would filter on a DateTime field rather than an array of timestamp.

Id personally go with the second option.

nlv
  • 791
  • 7
  • 28
  • "You're doing a .First() which I think means that only the first element from your dates array will be filtered." - this is not true; the _expression_ passed to `.Field()` is used to build a string path to the field on the model by visiting the expression. – Russ Cam Oct 13 '19 at 23:09
  • You're absolutely right, overlooked that the array has/would have other named objects. Will remove unnecessary comment. – nlv Oct 14 '19 at 07:38
  • Thanks for your reply Neil. Your suggestion sounds good and I will look into them for extra efficiency. However, after some further fiddling, I've discovered that I was actually encountering the standard delay on the first query when using NEST (https://stackoverflow.com/q/44725584/5392786). I change the order of my queries around and discovered that the first one always takes around a second, regardless of how simple the query actually is. The month query I was trying was <100ms if it wasn't the first to run! – Andy Furniss Oct 14 '19 at 10:42
1

The slowness issue you discovered in date range queries is an interesting and complex one.

Let's start with your comment "the first [query] always takes around a second". ElasticSearch (ES) queries are cached internally by nodes. If you perform a query for the very first time, the result for the query is not cached in a node's heap yet, and the node has to generate the result for the first time. That's why your first query takes longer, while subsequent queries with the same structure perform better.

However, if you run a profiler, the cache is deactivated and all queries with the same structure should more or less be executed in the same time span. Nonetheless, you're likely to encounter a significant slowness in date range queries compared to non-date range queries.

The core reason for the slowness of some date range queries in ElasticSearch seems to be related to the caching behavior of the nodes.

Quoting from an ElasticSearch discussion about the same problem:

When the range of the query covers the whole shard, a exists query is done instead, and it turns out these can be quite slow, sadly

exists is so slow and is done when a queried range encompasses a shard's range

So basically exists is so slow because it spends time caching all the documents (since they all have the timestamp field)

According to an Elastic Team Member, the fix (disable parts on the query cache) for this issue will be released with Lucene 8.6.

Source: https://discuss.elastic.co/t/time-range-query-performance-7-6/223194/14

Jay
  • 1,564
  • 16
  • 24