I have a large collection ~20m-30m documents that are indexed, but my aggregation performance is keeping up with the growing size. (Using DocumentDB which uses ~MongoDB 4.0).
Basic schema (there are a number of other fields, but these are the relevant ones) is
{
"groupId": "non-unique group id", // several can have the same one
"mainCategory": "mainCategory",
"subCategory": "subCategory",
"status": "status", // like success or failure for example
"dateField": "2023-01-01T00:00:00Z"
}
My compound index (I have other specific ones for groupId) is
{
"mainCategory": 1,
"subCategory": 1,
"dateField": "-1,
}
Aggregation Explanation (using an example use case): I want to find the latest records for each groupId that are in my main/sub category who have their latest status as failed for the give main/sub category pair within a groupId.
Example Data:
{
"groupId": "1",
"mainCategory": "breakfast",
"subCategory": "eggs",
"status": "success",
"dateField": "2023-01-20T00:00:00Z"
},
{
"groupId": "1",
"mainCategory": "breakfast",
"subCategory": "bacon",
"status": "success",
"dateField": "2023-01-20T00:00:00Z"
},
{
"groupId": "1",
"mainCategory": "breakfast",
"subCategory": "eggs",
"status": "failure",
"dateField": "2023-01-01T00:00:00Z"
}
With the above data, if i search for { mainCategory: "breakfast", subCategory: "eggs", "status": "failure"}
I'd expect to get no results back because the latest result (due to the sorting on the date) is a success.
If I did the same but searched for a status
of 'success' I'd expect to get one result back as the latest breakfast + eggs result in that group is a success.
This is why in the steps below I need to do the status match after the other steps, so that I don't filter things out by status until I know if they were the latest or not.
My aggregation steps are as follows:
- Match by
mainCategory
andsubCategory
- Sort descending by
dateField
(allowing$first
to work in step 4) - Limit to 100k (to try and reduce the performance load and only get the newest)
- Group by
groupId
andsubCategory
and only keep the first record of each
Exact grouping
{
_id: { groupId: "$groupId", subCategory: "$subCategory" },
"latestRecord": { "$first": "$$ROOT" }
}
- Match by
status
- Limit to 100
- replaceRoot
{ newRoot: "$latestRecord }
- Projection to reduce data coming back
It seems like step #2 (the sort) is where it times out/consumes the most time. I read some other posts on similar questions:
mongodb, compound indexes, sorting
Indexing MongoDB for quicker find() with sort(), on different fields
But they didn't lead me to solution. The explain tool shows it's using my index, but that's about all I can get out of it.
As far as I can tell it goes much faster if I remove step #2 even though it's sorting on an indexed date field. But that step is required for $first to work so I can't remove that.
Any advice would be appreciated!
Explain Printout on a smaller database with 2m-3m records:
{
"executionStats" : {
"executionSuccess" : true,
"executionTimeMillis" : "663941.077",
"planningTimeMillis" : "0.530",
"executionStages" : {
"stage" : "SUBSCAN",
"nReturned" : "100",
"executionTimeMillisEstimate" : "663921.912",
"inputStage" : {
"stage" : "LIMIT_SKIP",
"nReturned" : "100",
"executionTimeMillisEstimate" : "663914.564",
"inputStage" : {
"stage" : "SUBSCAN",
"nReturned" : "100",
"executionTimeMillisEstimate" : "663914.535",
"inputStage" : {
"stage" : "SORT_AGGREGATE",
"nReturned" : "100",
"executionTimeMillisEstimate" : "663914.492",
"inputStage" : {
"stage" : "SORT",
"nReturned" : "8547",
"executionTimeMillisEstimate" : "663655.029",
"inputStage" : {
"stage" : "IXSCAN",
"nReturned" : "461546",
"executionTimeMillisEstimate" : "660287.707",
"indexName" : "myIndex", // It is correctly using my index here
"direction" : "forward"
}
}
}
}
}
}
},
"serverInfo" : {
"host" : "my-database",
"port" : 27031.0,
"version" : "4.0.0"
},
Interesting note. I converted my dateField to be ISODate() instead of a string (and also queried using ISODate() instead of a string). And while the performance didn't really improve too much. It changed how the beginning of the aggregation worked. Ignore the exact execution times as this was a smaller db.
"executionStats" : {
"executionSuccess" : true,
"executionTimeMillis" : "2260.105",
"planningTimeMillis" : "0.672",
"executionStages" : {
"stage" : "SUBSCAN",
"nReturned" : "16",
"executionTimeMillisEstimate" : "2259.361",
"inputStage" : {
"stage" : "LIMIT_SKIP",
"nReturned" : "16",
"executionTimeMillisEstimate" : "2258.256",
"inputStage" : {
"stage" : "SUBSCAN",
"nReturned" : "16",
"executionTimeMillisEstimate" : "2258.251",
"inputStage" : {
"stage" : "SORT_AGGREGATE",
"nReturned" : "16",
"executionTimeMillisEstimate" : "2258.245",
"inputStage" : {
"stage" : "SORT",
"nReturned" : "848",
"executionTimeMillisEstimate" : "2221.134",
"inputStage" : {
"stage" : "SUBSCAN",
"nReturned" : "848",
"executionTimeMillisEstimate" : "2218.770",
"inputStage" : {
"stage" : "LIMIT_SKIP",
"nReturned" : "848",
"executionTimeMillisEstimate" : "2203.387",
"inputStage" : {
"stage" : "SORT",
"nReturned" : "848",
"executionTimeMillisEstimate" : "2203.311",
"sortPattern" : {
"dateField" : -1.0
},
"inputStage" : {
"stage" : "FETCH",
"nReturned" : "848",
"executionTimeMillisEstimate" : "2199.608",
"inputStage" : {
"stage" : "IXSCAN",
"nReturned" : "848",
"executionTimeMillisEstimate" : "16.417",
"indexName" : "myIndex"
}
}
}
}
}
}
}
}
}
}
},