Optimise MongoDB aggregate query performance

Question

I have next DB structure:

Workspaces:

	Key	Index
PK	id	id
	content

Projects:

	Key	Index
PK	id	id
FK	workspace	workspace_1
	deleted	deleted_1
	content

Items:

	Key	Index
PK	id	id
FK	project	project_1
	type	_type_1
	deleted	deleted_1
	content

I need to calculate a number of items of each type for each project in workspace, e.g. expected output:

[
  { _id: 'projectId1', itemType1Count: 100, itemType2Count: 50, itemType3Count: 200 },
  { _id: 'projectId2', itemType1Count: 40, itemType2Count: 100, itemType3Count: 300 },
  ....
]

After few attempts and some debugging I've created a query which provides output I needed:

const pipeline = [
    { $match: { workspace: 'workspaceId1' } },
    {
      $lookup: {
        from: 'items',
        let: { id: '$_id' },
        pipeline: [
          {
            $match: {
              $expr: {
                $eq: ['$project', '$$id'],
              },
            },
          },
          // project only fields necessary for later pipelines to not overload
          // memory and to not get `exceeded memory limit for $group` error
          { $project: { _id: 1, type: 1, deleted: 1 } },
        ],
        as: 'items',
      },
    },
    // Use $unwind here to optimize aggregation pipeline, see:
    // https://stackoverflow.com/questions/45724785/aggregate-lookup-total-size-of-documents-in-matching-pipeline-exceeds-maximum-d
    // Without $unwind we may get an `matching pipeline exceeds maximum document size` error.
    // Error appears not in all requests and it's really strange and hard to debug.
    { $unwind: '$items' },
    { $match: { 'items.deleted': { $eq: false } } },
    {
      $group: {
        _id: '$_id',
        items: { $push: '$items' },
      },
    },
    {
      $project: {
        _id: 1,
        // Note: I have only 3 possible item types, so it's OK that it's names hardcoded.
        itemType1Count: {
          $size: {
            $filter: {
              input: '$items',
              cond: { $eq: ['$$this.type', 'type1'] },
            },
          },
        },
        itemType2Count: {
          $size: {
            $filter: {
              input: '$items',
              cond: { $eq: ['$$this.type', 'type2'] },
            },
          },
        },
        itemType3Count: {
          $size: {
            $filter: {
              input: '$items',
              cond: { $eq: ['$$this.type', 'type3'] },
            },
          },
        },
      },
    },
  ]

const counts = await Project.aggregate(pipeline)

Query works like expected, but very slow... If I have some about 1000 items in one workspace it takes about 8 seconds to complete. Any ideas how to make it faster are appreciated.

Thanks.

score 1 · Accepted Answer · answered Oct 12 '21 at 23:56

Assuming your indexs are properly indexed that they contain the "correct" fields, we can still have some tweaks on the query itself.

Approach 1: keeping existing collection schema

db.projects.aggregate([
  {
    $match: {
      workspace: "workspaceId1"
    }
  },
  {
    $lookup: {
      from: "items",
      let: {id: "$_id"},
      pipeline: [
        {
          $match: {
            $expr: {
              $and: [
                {$eq: ["$project","$$id"]},
                {$eq: ["$deleted",false]}
              ]
            }
          }
        },
        // project only fields necessary for later pipelines to not overload
        // memory and to not get `exceeded memory limit for $group` error
        {
          $project: {
            _id: 1,
            type: 1,
            deleted: 1
          }
        }
      ],
      as: "items"
    }
  },
  // Use $unwind here to optimize aggregation pipeline, see:
  // https://stackoverflow.com/questions/45724785/aggregate-lookup-total-size-of-documents-in-matching-pipeline-exceeds-maximum-d
  // Without $unwind we may get an `matching pipeline exceeds maximum document size` error.
  // Error appears not in all requests and it's really strange and hard to debug.
  {
    $unwind: "$items"
  },
  {
    $group: {
      _id: "$_id",
      itemType1Count: {
        $sum: {
            "$cond": {
                "if": {$eq: ["$items.type","type1"]},
                "then": 1,
                "else": 0
            }
        }
      },
      itemType2Count: {
        $sum: {
            "$cond": {
                "if": {$eq: ["$items.type","type2"]},
                "then": 1,
                "else": 0
            }
        }
      },
      itemType3Count: {
        $sum: {
            "$cond": {
                "if": {$eq: ["$items.type","type1"]},
                "then": 1,
                "else": 0
            }
        }
      }
    }
  }
])

There are 2 major changes:

moving the items.deleted : false condition into the $lookup subpipeline to lookup less items documents
skipped items: { $push: '$items' }. Instead, do a conditional sum in later $group stage

Here is the Mongo playground for your reference. (at least for the correctness of the new query)

Approach 2: If the collection schema can be modified. We can denormalize projects.workspace into the items collection like this:

{
    "_id": "i1",
    "project": "p1",
    "workspace": "workspaceId1",
    "type": "type1",
    "deleted": false
}

In this way, you can skip the $lookup. A simple $match and $group will suffice.

db.items.aggregate([
  {
    $match: {
      "deleted": false,
      "workspace": "workspaceId1"
    }
  },
  {
    $group: {
      _id: "$project",
      itemType1Count: {
        $sum: {
          "$cond": {
            "if": {$eq: ["$type","type1"]},
            "then": 1,
            "else": 0
          }
        }
      },
      ...

Here is the Mongo playground with denormalized schema for your reference.

Thanks for the answer. Approach 1 looks much cleaner, but, unfortunately, it's execution time almost the same as for my original query ~8 seconds on the same data. I may notice, that `items` collection is quite big (about 700K records). Maybe it's maximum possible result for such aggregation? — Kuzzy, Oct 13 '21 at 09:42
Can you identify whether it is the lookup that is slow? If that is the case, you might want to add a composite index to items on project and deleted field. That should make the initial $match faster — ray, Oct 13 '21 at 09:46
Can you try to use [explain](https://docs.mongodb.com/manual/reference/method/cursor.explain/) to check your query execution plan? My answer can only serve as some intuitive directions for what we can try in the aspect of the code. We may need more information like index usage to provide more help. — ray, Oct 13 '21 at 12:49
Explain shows execution plan only for first stage with `$match`. And yes, I'm completely sure that slowest part of pipeline is `$lookup`, all other stages works very fast. Also, Interesting thing that this query with the same data works 2x faster on production env than in local. Perhaps it because more free RAM available for indexes? In any way, I've decided to try second approach and modify data structure. It will help to make other queries for reports faster. Thanks for the help, @ray — Kuzzy, Oct 14 '21 at 11:49

Optimise MongoDB aggregate query performance

1 Answers1