2

I'm new to elasticsearch domain. I'm learning and trying it out to check if it meets my needs.

Right now I'm learning aggregations in elasticsearch and I wrote the following python script to ingest some time-series data into elasticsearch.

Every 5 seconds I create a new message which will have:

  1. Timestamp (ISO8601 format)
  2. Counter
  3. A random number between 0 and 100

For every new day, I create a new index with logs_Y-m-D as the index name.

I will index every message using the message Counter as the _id. The counter resets for every new index (every day).


import csv
import time
import random
from datetime import datetime
from elasticsearch import Elasticsearch


class ElasticSearchDB:
    def __init__(self):
        self.es = Elasticsearch()

    def run(self):
        print("Started: {}".format(datetime.now().isoformat()))
        print("<Ctrl + c> for exit!")

        with open("..\\out\\logs.csv", "w", newline='') as f:
            writer = csv.writer(f)
            counter = 0
            try:
                while True:
                    i_name = "logs_" + time.strftime("%Y-%m-%d")
                    if not self.es.indices.exists([i_name]):
                        self.es.indices.create(i_name, ignore=400)
                        print("New index created: {}".format(i_name))
                        counter = 0

                    message = {"counter": counter, "@timestamp": datetime.now().isoformat(), "value": random.randint(0, 100)}
                    # Write to file
                    writer.writerow(message.values())
                    # Write to elasticsearch index
                    self.es.index(index=i_name, doc_type="logs", id=counter, body=message)
                    # Waste some time
                    time.sleep(5)
                    counter += 1

            except KeyboardInterrupt:
                print("Stopped: {}".format(datetime.now().isoformat()))


test_es = ElasticSearchDB()
test_es.run()

I ran this script for 30 minutes. Next, using Sense, I query elasticsearch with following aggregation queries.

Query #1: Get all

Query #2: Aggregate logs from last 1 hour and generate stats for them. This shows right results.

Query #3: Aggregate logs from last 1 minute and generate stats for them. The number of docs aggregated is same as in from 1-hour aggregations, ideally, it should have aggregated only 12-13 logs.

Query #4: Aggregate logs from last 15 seconds and generate stats for them. The number of docs aggregated is same as in from 1-hour aggregations, ideally, it should have aggregated only 3-4 logs.

My Questions:

  1. Why is elasticsearch not able to understand 1 minute and 15 seconds range?
  2. I understand mappings but I don't know how to write one, so I've not written one, is that what is causing this problem?

Please help!


Query #1: Get all

GET /_search

Output:

{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 314,
      "max_score": 1,
      "hits": [
         {
            "_index": "logs_2016-11-03",
            "_type": "logs",
            "_id": "19",
            "_score": 1,
            "_source": {
               "counter": 19,
               "value": 62,
               "@timestamp": "2016-11-03T07:40:35.981395"
            }
         },
         {
            "_index": "logs_2016-11-03",
            "_type": "logs",
            "_id": "22",
            "_score": 1,
            "_source": {
               "counter": 22,
               "value": 95,
               "@timestamp": "2016-11-03T07:40:51.066395"
            }
         },
         {
            "_index": "logs_2016-11-03",
            "_type": "logs",
            "_id": "25",
            "_score": 1,
            "_source": {
               "counter": 25,
               "value": 18,
               "@timestamp": "2016-11-03T07:41:06.140395"
            }
         },
         {
            "_index": "logs_2016-11-03",
            "_type": "logs",
            "_id": "26",
            "_score": 1,
            "_source": {
               "counter": 26,
               "value": 58,
               "@timestamp": "2016-11-03T07:41:11.164395"
            }
         },
         {
            "_index": "logs_2016-11-03",
            "_type": "logs",
            "_id": "29",
            "_score": 1,
            "_source": {
               "counter": 29,
               "value": 73,
               "@timestamp": "2016-11-03T07:41:26.214395"
            }
         },
         {
            "_index": "logs_2016-11-03",
            "_type": "logs",
            "_id": "41",
            "_score": 1,
            "_source": {
               "counter": 41,
               "value": 59,
               "@timestamp": "2016-11-03T07:42:26.517395"
            }
         },
         {
            "_index": "logs_2016-11-03",
            "_type": "logs",
            "_id": "14",
            "_score": 1,
            "_source": {
               "counter": 14,
               "value": 9,
               "@timestamp": "2016-11-03T07:40:10.857395"
            }
         },
         {
            "_index": "logs_2016-11-03",
            "_type": "logs",
            "_id": "40",
            "_score": 1,
            "_source": {
               "counter": 40,
               "value": 9,
               "@timestamp": "2016-11-03T07:42:21.498395"
            }
         },
         {
            "_index": "logs_2016-11-03",
            "_type": "logs",
            "_id": "24",
            "_score": 1,
            "_source": {
               "counter": 24,
               "value": 41,
               "@timestamp": "2016-11-03T07:41:01.115395"
            }
         },
         {
            "_index": "logs_2016-11-03",
            "_type": "logs",
            "_id": "0",
            "_score": 1,
            "_source": {
               "counter": 0,
               "value": 79,
               "@timestamp": "2016-11-03T07:39:00.302395"
            }
         }
      ]
   }
}

Query #2: Get stats from last 1 hour.

GET /logs_2016-11-03/logs/_search?search_type=count
{
    "aggs": {
        "time_range": {
            "filter": {
                "range": {
                    "@timestamp": {
                        "from": "now-1h"
                    }
                }
            },
            "aggs": {
                "just_stats": {
                    "stats": {
                        "field": "value"
                    }
                }
            }
        }
    }
}

Output:

{
   "took": 5,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 366,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "time_range": {
         "doc_count": 366,
         "just_stats": {
            "count": 366,
            "min": 0,
            "max": 100,
            "avg": 53.17213114754098,
            "sum": 19461
         }
      }
   }
}

I get 366 entries, which is correct.

Query #3: Get stats from last 1 minute.

GET /logs_2016-11-03/logs/_search?search_type=count
{
    "aggs": {
        "time_range": {
            "filter": {
                "range": {
                    "@timestamp": {
                        "from": "now-1m"
                    }
                }
            },
            "aggs": {
                "just_stats": {
                    "stats": {
                        "field": "value"
                    }
                }
            }
        }
    }
}

Output:

{
   "took": 15,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 407,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "time_range": {
         "doc_count": 407,
         "just_stats": {
            "count": 407,
            "min": 0,
            "max": 100,
            "avg": 53.152334152334156,
            "sum": 21633
         }
      }
   }
}

This is wrong, it can't be 407 entries in last 1 minute, it should have been 12-13 logs only.

Query #4: Get stats from last 15 seconds.

GET /logs_2016-11-03/logs/_search?search_type=count
{
    "aggs": {
        "time_range": {
            "filter": {
                "range": {
                    "@timestamp": {
                        "from": "now-15s"
                    }
                }
            },
            "aggs": {
                "just_stats": {
                    "stats": {
                        "field": "value"
                    }
                }
            }
        }
    }
}

Output:

{
   "took": 15,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 407,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "time_range": {
         "doc_count": 407,
         "just_stats": {
            "count": 407,
            "min": 0,
            "max": 100,
            "avg": 53.152334152334156,
            "sum": 21633
         }
      }
   }
}

This is also wrong, it can't be 407 entries in last 15 seconds. It should have been only 3-4 logs only.

HaggarTheHorrible
  • 7,083
  • 20
  • 70
  • 81

1 Answers1

2

Your query is right but ES stores date in UTC and hence you are getting everything back. From the documentation

In JSON documents, dates are represented as strings. Elasticsearch uses a set of preconfigured formats to recognize and parse these strings into a long value representing milliseconds-since-the-epoch in UTC.

You could use the pytz module and store dates in UTC in ES. Refer to this SO question.

You could also use time_zone param in range query, also it is better to aggregate on filtered results rather than get all the results and then filter on all of them.

GET /logs_2016-11-03/logs/_search
{
  "query": {
    "bool": {
      "filter": {
        "range": {
          "@timestamp": {
            "gte": "2016-11-03T07:15:35",         <----- You would need absolute value
            "time_zone": "-01:00"              <---- timezone setting
          }
        }
      }
    }
  },
  "aggs": {
    "just_stats": {
      "stats": {
        "field": "value"
      }
    }
  },
  "size": 0
}

You would have to convert desired time(now-1m, now-15s) to format yyyy-MM-dd'T'HH:mm:ss for time_zone param to work as now is not affected by time_zone so best option is to convert dates to UTC and store it.

Community
  • 1
  • 1
ChintanShah25
  • 12,366
  • 3
  • 43
  • 44
  • Need help with "size":0. I tried looking for it online could not quite get it. What is its emphasis? – HaggarTheHorrible Nov 03 '16 at 06:42
  • 1
    "size":0, will return [only aggregation results](https://www.elastic.co/guide/en/elasticsearch/reference/current/returning-only-agg-results.html) and omit search results. You can remove it if you also want search hits – ChintanShah25 Nov 03 '16 at 13:09