1

I'm building an application that periodically queries system resource usage and records the data into ElasticSearch. I want to eventually show this information as a graph for a given time period. Note that generally users will want to view statistics for a set time period -

  • The current day
  • The current month
  • The current year

Because of this, I've been trying to think of the most efficient way of storing the data into ElasticSearch for optimized search speeds. Obviously each entry has a separate DateTime field (down to the millisecond), but searches will be much faster if I can perform a query only for specific indices.

My plan is to set the index as the current day (i.e. 2014_04_09). According to this, you can link multiple indices to a single alias. In this case, I would set an alias on the above for 2014_04 as well as 2014. The idea behind this being I can perform a search on the 2014_04 index and this will automatically search all of the individual indexed days in April. Will this work and, if so, is it optimal?

Has anyone else had a similar experience with DateTime queries in ElasticSearch? Thanks!

Stephen
  • 309
  • 1
  • 4
  • 16
  • Is there any particular reason you want different indices per day? You could also index the datetime as a field on each document, and use a `range filter` in your queries (since the filter will be cached, the performance should be pretty good that way). – Chris Apr 09 '14 at 20:41
  • @Chris - there's no specific reason for me to use different indices for every day - I'm just trying to work out the best way to store millions of data entries. If the `range filter` is indeed cached, that seems like it would work well. Presumably the filter results get updated as stuff is pushed into ElasticSearch? – Stephen Apr 09 '14 at 21:06

2 Answers2

2

I would read the entirety of this article to give you some more insight, it does touch on Elasticsearch and Timestamped data. Hope this helps.

Nathan Smith
  • 8,271
  • 3
  • 27
  • 44
1

As you are kind of getting at in your comments, it makes a lot more sense to combine these into one index because it's the same information and it will make future queries much simpler.

By making an index daily, monthly, and yearly, then you are going to have to triple index your document or come up with complicated logic to control the aliasing that is, in my opinion, not worth it while creating a huge amount of indices (one per day). If you are doing this for logging, then logstash will obviously be a better answer, as noted by Nate. It's probably worth noting in that case that you can turn off indices ("close") when they are not providing any value, and they will therefore not have any negative impact beyond taking up disk space.

Off the topic of logging, to create N indices will inherently result in multiple shards (at least N). Adding too many shards will unnecessarily slow things down when a single one will suffice. To do the work with aliasing will create frequent maintenance as you add additional indices.

By combining these into one index, you can easily perform analytics on demand with high performance and you can more easily scale Elasticsearch across multiple nodes when the time comes. Usefully, you will almost certainly find more complicated aggregations down the road and you will likely benefit from the simpler indexing.

You will receive updates that apply to your filter as they come in even if it is cached. This can be easily proven by generating a simple filter, running it, and then adding something else within its expected result set.

Community
  • 1
  • 1
pickypg
  • 22,034
  • 5
  • 72
  • 84
  • Thank you for the thoughtful reply! If I go with a range filter, how much additional space (RAM) is this going to take up? I could potentially have millions of records for several years worth of data. – Stephen Apr 11 '14 at 17:04
  • Also, what is the difference between a `range filter` and an `index alias` with a filter applied to it? – Stephen Apr 11 '14 at 18:05
  • The question about RAM really depends on the amount of data, but the [default filter cache size is 20% of your node's memory](http://www.elasticsearch.org/guide/en/elasticsearch/reference/1.x/index-modules-cache.html). The difference between the two filters is that the range filter will cache results of the query, while the index alias' filter will cache a bunch of indexes to use, which would be useful, but no where near the same performance benefit (there's nothing stopping you from using both, but obviously some make less sense if you are already using the aliased index approach). – pickypg Apr 12 '14 at 19:04