1

I am trying to search an index using DSL query. I have many documents which matches the criteria of log and the range of timestamp.
I am passing dates and converting it to epoch milli seconds.
But I am specifying size parameter in DSL query.
What I see is that if I specify 5000, it extracts 5000 records in the time range. But there are more number of records in the specified time range.
How to retrieve all data matching the range of time so that I dont need to specify the size?

My DSL query is as below.

GET localhost:9200/_search    
{
    "query": {
      "bool": {
        "must": [
          {"match_phrase": {
              "log":  "SOME_VALUE"
              }
            },
             {"range": {
                "@timestamp": {
                  "gte": "'"${fromDate}"'", 
                  "lte": "'"${toDate}"'", 
                  "format": "epoch_millis"
                }
              }
            }
                ]
              }
            },    
        "size":5000
}

fromDate = 1519842600000
toDate = 1520533800000

nirmalraj17
  • 494
  • 7
  • 20
  • Possibly a duplicate of https://stackoverflow.com/questions/8829468/elasticsearch-query-to-return-all-records – IanGabes Mar 09 '18 at 20:20
  • No this is not a duplicate.. i have already seen that. It mentions how to retrieve all data and it dont have conditions or additional parameters not even size parameter. – nirmalraj17 Mar 09 '18 at 20:41
  • You should look at the scan and scroll API to achieve what you want, as the answer in the above question indicates. – IanGabes Mar 09 '18 at 20:51
  • Scroll will look at where we specify time to be content alive. I tried that but it is giving 5 hits if size is not specified. That also is not solving problem.. – nirmalraj17 Mar 09 '18 at 20:58

1 Answers1

1

I couldn't get the scan API or scroll pattern working as it was also not showing expected result.

I finally figured out a way to capture the number of hits and then pass that as parameter to extract the data.

GET localhost:9200/_count    
{
"query": {
  "bool": {
    "must": [
      {"match_phrase": {
          "log":  "SOME_VALUE"
          }
        },
         {"range": {
            "@timestamp": {
              "gte": "'"${fromDate}"'", 
              "lte": "'"${toDate}"'", 
              "format": "epoch_millis"
            }
          }
        }
            ]
          }
        }
}' > count_size.txt
size_count=`cat count_size.txt  | cut -d "," -f1 | cut -d ":" -f2`
echo "Total hits matching this criteria is ${size_count}"

From this I get the size_count value. If this value is less than 10000, extract the value, else reduce the time range for extraction.

GET localhost:9200/_search    
{
"query": {
  "bool": {
    "must": [
      {"match_phrase": {
          "log":  "SOME_VALUE"
          }
        },
         {"range": {
            "@timestamp": {
              "gte": "'"${fromDate}"'", 
              "lte": "'"${toDate}"'", 
              "format": "epoch_millis"
            }
          }
        }
            ]
          }
        },    
    "size":'"${size_count}"'
}

If large set of data is required for an extensive period, I need to run this with a different set of dates and combine them together to get an overall required reports.

This complete piece of code is written is shell script so I am able to use it much simpler.

nirmalraj17
  • 494
  • 7
  • 20