0

My use case is the following : I have continuously produced time-series data + one year history. I want to index them into Elasticsearch in such a way that data is deleted after one year (according to the @timestamp field).

Data streams seem to be the perfect solution for the newly producted time-series data. They get indexed as soon as they are created, and the ILM will delete the associated backing indices at the right moment in one year.

However, I'm stuck with the historical datas. How to index them in such a way that the historical data will be deleted at the right time ? As the rollover is based on the index age and not the documents @timestamp fields, all associated backing indices will be also deleted in one year, even if they contains older data. In my use case, this typically means that the oldest historical data will remain two years in the cluster, which is not the expected behaviour.

Do you have any ideas to fix this ?

qcha
  • 523
  • 1
  • 5
  • 19

1 Answers1

1

You have the possibility to override this behavior and provide your own index.lifecycle.origination_date

If specified, this is the timestamp used to calculate the index age for its phase transitions. Use this setting if you create a new index that contains old data and want to use the original creation date to calculate the index age. Specified as a Unix epoch value in milliseconds.

So you can index your old data into your data streams and for each backing index you can dynamically set the timestamp that should correspond to the date the index would have been created if that old historical data had been indexed back then.

PUT .ds-index-xxx/_settings
{
   "index.lifecycle.origination_date": "2020-01-01"
}

You can find the max timestamp to use for each backing index using the following query:

POST index/_search
{
  "size": 0,
  "aggs": {
    "index": {
      "terms": {
        "field": "_index"
      },
      "aggs": {
        "date": {
          "max": {
            "field": "@timestamp"
          }
        }
      }
    }
  }
}
Val
  • 207,596
  • 13
  • 358
  • 360
  • Thank you so much ! I didn't know this option. However, how to set it "dynamically" as you say in the case of a data stream ? Ideally it would be the max value of the @timestamp values contains in each backing index, but this parameter expects a hardcoded timestamp – qcha Jun 29 '23 at 09:42
  • I've updated my answer. You only have to do it once per index when indexing your old data – Val Jun 29 '23 at 09:52
  • Ok but the date is not necessarily obvious to find for each backing index. Consider the case where the rollover is based on the index size, let's say 50Gb : each backing index should have then it's own origination date that would be the max value / youngest of the @timestamp values it contains. Is there a way to extract it automatically and set it as origination date ? – qcha Jun 29 '23 at 10:03
  • A simple query with `"sort": {"@timestamp": "desc"}` on each backing index would give you that value. I've updated my answer with a query that gives you the max timestamp for all backing indexes – Val Jun 29 '23 at 11:24