I run aggregation that on 2 indices: idx-2020-07-21, idx-2020-07-22 The target: Get all documents, but in the case of duplicate id (50% are), get the one from the latest index using the index name.
This is the query I'm running
{
"size": 0,
"aggregations": {
"latest_item": {
"composite": {
"size": 1000,
"sources": [
{
"product": {
"terms": {
"field": "_id",
"missing_bucket": false,
"order": "asc"
}
}
}
]
},
"aggregations": {
"max_date": {
"top_hits": {
"from": 0,
"size": 1,
"version": false,
"explain": false,
"sort": [
{
"_index": {
"order": "desc"
}
}
]
}
}
}
}
}
}
Each index size is 8G with ~1M docs. ES version 7.5
and it takes around 8Min to aggregate, most of the times I get
{"error":{"root_cause":[{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [32933676058/30.6gb], which is larger than the limit of [32641751449/30.3gb].
- Is there a better way to write this query?
- How do I deal with this exception?
- I run a java job that query ES every 10 min, I noticed it happened a lot in the second time, do I need to release any resources or something? I use restHighLevelClient.searchAsync() with a listener that call again with the next key until I get null.
The cluster has 3 nodes, 32G each.
I tries to play with the bucket size it didn't help a lot.
Thanks!