3

I have a large collection ~20m-30m documents that are indexed, but my aggregation performance is keeping up with the growing size. (Using DocumentDB which uses ~MongoDB 4.0).

Basic schema (there are a number of other fields, but these are the relevant ones) is

{
  "groupId": "non-unique group id", // several can have the same one
  "mainCategory": "mainCategory",
  "subCategory": "subCategory",
  "status": "status", // like success or failure for example
  "dateField": "2023-01-01T00:00:00Z"
}

My compound index (I have other specific ones for groupId) is

{
  "mainCategory": 1,
  "subCategory": 1,
  "dateField": "-1,
}

Aggregation Explanation (using an example use case): I want to find the latest records for each groupId that are in my main/sub category who have their latest status as failed for the give main/sub category pair within a groupId.

Example Data:

{
  "groupId": "1",
  "mainCategory": "breakfast",
  "subCategory": "eggs",
  "status": "success",
  "dateField": "2023-01-20T00:00:00Z"
},
{
  "groupId": "1",
  "mainCategory": "breakfast",
  "subCategory": "bacon",
  "status": "success",
  "dateField": "2023-01-20T00:00:00Z"
},
{
  "groupId": "1",
  "mainCategory": "breakfast",
  "subCategory": "eggs",
  "status": "failure",
  "dateField": "2023-01-01T00:00:00Z"
}

With the above data, if i search for { mainCategory: "breakfast", subCategory: "eggs", "status": "failure"} I'd expect to get no results back because the latest result (due to the sorting on the date) is a success.

If I did the same but searched for a status of 'success' I'd expect to get one result back as the latest breakfast + eggs result in that group is a success.

This is why in the steps below I need to do the status match after the other steps, so that I don't filter things out by status until I know if they were the latest or not.

My aggregation steps are as follows:

  1. Match by mainCategory and subCategory
  2. Sort descending by dateField (allowing $first to work in step 4)
  3. Limit to 100k (to try and reduce the performance load and only get the newest)
  4. Group by groupId and subCategory and only keep the first record of each

Exact grouping

{
  _id: { groupId: "$groupId", subCategory: "$subCategory" },
  "latestRecord": { "$first": "$$ROOT" }
}
  1. Match by status
  2. Limit to 100
  3. replaceRoot { newRoot: "$latestRecord }
  4. Projection to reduce data coming back

It seems like step #2 (the sort) is where it times out/consumes the most time. I read some other posts on similar questions:

MongoDB Compound Index

mongodb, compound indexes, sorting

Indexing MongoDB for quicker find() with sort(), on different fields

But they didn't lead me to solution. The explain tool shows it's using my index, but that's about all I can get out of it.

As far as I can tell it goes much faster if I remove step #2 even though it's sorting on an indexed date field. But that step is required for $first to work so I can't remove that.

Any advice would be appreciated!

Explain Printout on a smaller database with 2m-3m records:

{
"executionStats" : {
    "executionSuccess" : true,
    "executionTimeMillis" : "663941.077",
    "planningTimeMillis" : "0.530",
    "executionStages" : {
        "stage" : "SUBSCAN",
        "nReturned" : "100",
        "executionTimeMillisEstimate" : "663921.912",
        "inputStage" : {
            "stage" : "LIMIT_SKIP",
            "nReturned" : "100",
            "executionTimeMillisEstimate" : "663914.564",
            "inputStage" : {
                "stage" : "SUBSCAN",
                "nReturned" : "100",
                "executionTimeMillisEstimate" : "663914.535",
                "inputStage" : {
                    "stage" : "SORT_AGGREGATE",
                    "nReturned" : "100",
                    "executionTimeMillisEstimate" : "663914.492",
                    "inputStage" : {
                        "stage" : "SORT",
                        "nReturned" : "8547",
                        "executionTimeMillisEstimate" : "663655.029",
                        "inputStage" : {
                            "stage" : "IXSCAN",
                            "nReturned" : "461546",
                            "executionTimeMillisEstimate" : "660287.707",
                            "indexName" : "myIndex", // It is correctly using my index here
                            "direction" : "forward"
                        }
                    }
                }
            }
        }
    }
},
"serverInfo" : {
    "host" : "my-database",
    "port" : 27031.0,
    "version" : "4.0.0"
},

Interesting note. I converted my dateField to be ISODate() instead of a string (and also queried using ISODate() instead of a string). And while the performance didn't really improve too much. It changed how the beginning of the aggregation worked. Ignore the exact execution times as this was a smaller db.

"executionStats" : {
    "executionSuccess" : true,
    "executionTimeMillis" : "2260.105",
    "planningTimeMillis" : "0.672",
    "executionStages" : {
        "stage" : "SUBSCAN",
        "nReturned" : "16",
        "executionTimeMillisEstimate" : "2259.361",
        "inputStage" : {
            "stage" : "LIMIT_SKIP",
            "nReturned" : "16",
            "executionTimeMillisEstimate" : "2258.256",
            "inputStage" : {
                "stage" : "SUBSCAN",
                "nReturned" : "16",
                "executionTimeMillisEstimate" : "2258.251",
                "inputStage" : {
                    "stage" : "SORT_AGGREGATE",
                    "nReturned" : "16",
                    "executionTimeMillisEstimate" : "2258.245",
                    "inputStage" : {
                        "stage" : "SORT",
                        "nReturned" : "848",
                        "executionTimeMillisEstimate" : "2221.134",
                        "inputStage" : {
                            "stage" : "SUBSCAN",
                            "nReturned" : "848",
                            "executionTimeMillisEstimate" : "2218.770",
                            "inputStage" : {
                                "stage" : "LIMIT_SKIP",
                                "nReturned" : "848",
                                "executionTimeMillisEstimate" : "2203.387",
                                "inputStage" : {
                                    "stage" : "SORT",
                                    "nReturned" : "848",
                                    "executionTimeMillisEstimate" : "2203.311",
                                    "sortPattern" : {
                                        "dateField" : -1.0
                                    },
                                    "inputStage" : {
                                        "stage" : "FETCH",
                                        "nReturned" : "848",
                                        "executionTimeMillisEstimate" : "2199.608",
                                        "inputStage" : {
                                            "stage" : "IXSCAN",
                                            "nReturned" : "848",
                                            "executionTimeMillisEstimate" : "16.417",
                                            "indexName" : "myIndex"
                                        }
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
},
canpan14
  • 1,181
  • 1
  • 14
  • 36
  • Can you post the full aggregation and perhaps also the explain output? Based on my current understanding, I'm confused as to why the match on `filterField3` is being done in step 5 _after_ the `$group`ing? That will prevent an index from being used for that and, based on the current index definition, prevent the database from using the index to provide the sort (and potentially doing a more optimal 'distinct scan'). Without further details, I think `filterField3` should be removed from the index definition or the filter should be pulled up into step 1. Also the `$limit` probably needs to go – user20042973 Jan 27 '23 at 16:45
  • I'll update the post with a better description, but basically if I filter by filterField3 earlier it will improperly list the 'latest' entries in that group. filterField3 is basically a lower level status field while the others are categories. If I filter early on only the 'Failures' it will hide the fact that a 'success' actually happened afterwards and therefore the 'latest' set of entries has no failures. – canpan14 Jan 27 '23 at 17:13
  • @user20042973 I added an explain printout to the end of it – canpan14 Jan 27 '23 at 17:48
  • Added the execution stats as well – canpan14 Jan 27 '23 at 17:56
  • Do you consider building and maintaining a new collection that stores only the latest record by your grouping? That would be a faster approach if your use case is a frequent one. – ray Feb 01 '23 at 17:43
  • MongoDB 4.2.18+ applied some major performance improvements for sort + skip type queries. But I see you are using an old version of DocumentDb. Have you tried on MongoDB instead to compare performance? – barrypicker Feb 02 '23 at 01:25
  • @ray interesting idea, but give how fast we intake records, wouldn't it constantly have find/delete records to support that? I don't know enough about mongo to know how much that would hurt performance. I'm all for the idea though. – canpan14 Feb 02 '23 at 13:29
  • @barrypicker yeah sadly DocumentDB uses a mediocre version of mongo 4.0. I was going to talk about moving to MongoDB Atlas if it turns out this is a true limitation. It feels odd though since people work with much larger dbs than this. And most of the aggregation time is spent on the initial fetch. I've also noticed once the section of the index is loaded into memory that the query uses, the query is basically instant. So if I keep running the same query, it'll start doing it in under a second – canpan14 Feb 02 '23 at 13:31
  • 1
    @canpan14, it doesn't use mongo at all. It only uses mongo-to-some-degree-compatible API. They use completely different backends, storage engines, replication methods. When you dig as deep as search optimisation, you should really rely on documentdb specific features, rather than docs and examples from Mongodb. AWS is pretty clear on this account: https://docs.aws.amazon.com/documentdb/latest/developerguide/user_diagnostics.html#user_diagnostics-query_plan E.g. Mongo doesn't have IXAND stage even in v6. – Alex Blex Feb 02 '23 at 14:37
  • I clearly didn't understand DocumentDB enough. I think I was mislead by this page https://docs.aws.amazon.com/documentdb/latest/developerguide/compatibility.html – canpan14 Feb 02 '23 at 14:53
  • Sure, I see where confusion comes from. It's a long story. This is how the other side sees it: https://www.mongodb.com/compare/documentdb-vs-mongodb IIRC at the beginning DocumentDB was essentially a replication and orchestration layer on top of mysql, with client-facing mongodb wire protocol. Unfortunately there is no much documentation on internal implementation or query optimisation. At some point there were rumours to migrate DocumentDB to WiredTiger engine, but after Mongo bought it these plans were obviously scrapped. – Alex Blex Feb 02 '23 at 15:09
  • Always the hard part when trying new products! Do you think what I'm trying to do here is surpassing what DocumentDB can do? I'm not against moving to MongoDB Atlas, but I don't want to rush to make that transition unless this is truly limitation of DocumentDB – canpan14 Feb 02 '23 at 15:12
  • Good question. I cannot answer it tho. First of all it wouldn't be ethical, secondly it's not a question of a single aggregation but all queries you make to the database, SLAs with vendors, etc. If I had to take responsibility I would definitely run the application against both vendors, do some stress-testing, ideally test how you handle primary switchover on Atlas - something you won't experience with DocumentDb. Near-zero initial investment is one of the very few advantages of cloud DBs, so use it to your benefits. – Alex Blex Feb 02 '23 at 15:26
  • Good advice! It's a new database so luckily I have the advantage of not needing to worry about legacy tie-ins. Thanks again! – canpan14 Feb 02 '23 at 15:40
  • Is `"my index"` the edited index with 3 fields (or the original one from the post that had 4)? – user20042973 Feb 02 '23 at 16:51
  • @user20042973 the edited index. I tried both side by side and it preferred the trimmed down one. – canpan14 Feb 02 '23 at 17:03
  • Right, I would expect it to. That's one of the points that I was trying to make in my original comment. And the explain output in the question is reflective of the plan using the new index? I'm surprised to see a sort I think – user20042973 Feb 02 '23 at 21:09
  • @user20042973 Correct, i actually tried this afternoon converting the date field to be ISODate() instead of a string and while it didn't really change the performance too much. It did change the how the initial part of the aggregation worked which was odd. I'll paste that information in to the bottom of my post. – canpan14 Feb 02 '23 at 23:05
  • How big is myIndex? DocumentDB doesn't have compression, indexes tend to be larger than with MongoDB. It also uses a fixed cache reservation, 75% is reserved for index and data caching, 25% is reserved for the engine. You may find that you need to scale up the instance type. – Mihai A Feb 13 '23 at 10:06
  • @MihaiA about 4gb in 300gb collection. I'm going to raise the instance type up to db.r6g.xlarge from db.r6g.large and see if that helps. – canpan14 Feb 13 '23 at 17:36

0 Answers0