0

I have next DB structure:

Workspaces:

Key Index
PK id id
content

Projects:

Key Index
PK id id
FK workspace workspace_1
deleted deleted_1
content

Items:

Key Index
PK id id
FK project project_1
type _type_1
deleted deleted_1
content

I need to calculate a number of items of each type for each project in workspace, e.g. expected output:

[
  { _id: 'projectId1', itemType1Count: 100, itemType2Count: 50, itemType3Count: 200 },
  { _id: 'projectId2', itemType1Count: 40, itemType2Count: 100, itemType3Count: 300 },
  ....
]

After few attempts and some debugging I've created a query which provides output I needed:

const pipeline = [
    { $match: { workspace: 'workspaceId1' } },
    {
      $lookup: {
        from: 'items',
        let: { id: '$_id' },
        pipeline: [
          {
            $match: {
              $expr: {
                $eq: ['$project', '$$id'],
              },
            },
          },
          // project only fields necessary for later pipelines to not overload
          // memory and to not get `exceeded memory limit for $group` error
          { $project: { _id: 1, type: 1, deleted: 1 } },
        ],
        as: 'items',
      },
    },
    // Use $unwind here to optimize aggregation pipeline, see:
    // https://stackoverflow.com/questions/45724785/aggregate-lookup-total-size-of-documents-in-matching-pipeline-exceeds-maximum-d
    // Without $unwind we may get an `matching pipeline exceeds maximum document size` error.
    // Error appears not in all requests and it's really strange and hard to debug.
    { $unwind: '$items' },
    { $match: { 'items.deleted': { $eq: false } } },
    {
      $group: {
        _id: '$_id',
        items: { $push: '$items' },
      },
    },
    {
      $project: {
        _id: 1,
        // Note: I have only 3 possible item types, so it's OK that it's names hardcoded.
        itemType1Count: {
          $size: {
            $filter: {
              input: '$items',
              cond: { $eq: ['$$this.type', 'type1'] },
            },
          },
        },
        itemType2Count: {
          $size: {
            $filter: {
              input: '$items',
              cond: { $eq: ['$$this.type', 'type2'] },
            },
          },
        },
        itemType3Count: {
          $size: {
            $filter: {
              input: '$items',
              cond: { $eq: ['$$this.type', 'type3'] },
            },
          },
        },
      },
    },
  ]

const counts = await Project.aggregate(pipeline)

Query works like expected, but very slow... If I have some about 1000 items in one workspace it takes about 8 seconds to complete. Any ideas how to make it faster are appreciated.

Thanks.

Kuzzy
  • 562
  • 1
  • 9
  • 24

1 Answers1

1

Assuming your indexs are properly indexed that they contain the "correct" fields, we can still have some tweaks on the query itself.

Approach 1: keeping existing collection schema

db.projects.aggregate([
  {
    $match: {
      workspace: "workspaceId1"
    }
  },
  {
    $lookup: {
      from: "items",
      let: {id: "$_id"},
      pipeline: [
        {
          $match: {
            $expr: {
              $and: [
                {$eq: ["$project","$$id"]},
                {$eq: ["$deleted",false]}
              ]
            }
          }
        },
        // project only fields necessary for later pipelines to not overload
        // memory and to not get `exceeded memory limit for $group` error
        {
          $project: {
            _id: 1,
            type: 1,
            deleted: 1
          }
        }
      ],
      as: "items"
    }
  },
  // Use $unwind here to optimize aggregation pipeline, see:
  // https://stackoverflow.com/questions/45724785/aggregate-lookup-total-size-of-documents-in-matching-pipeline-exceeds-maximum-d
  // Without $unwind we may get an `matching pipeline exceeds maximum document size` error.
  // Error appears not in all requests and it's really strange and hard to debug.
  {
    $unwind: "$items"
  },
  {
    $group: {
      _id: "$_id",
      itemType1Count: {
        $sum: {
            "$cond": {
                "if": {$eq: ["$items.type","type1"]},
                "then": 1,
                "else": 0
            }
        }
      },
      itemType2Count: {
        $sum: {
            "$cond": {
                "if": {$eq: ["$items.type","type2"]},
                "then": 1,
                "else": 0
            }
        }
      },
      itemType3Count: {
        $sum: {
            "$cond": {
                "if": {$eq: ["$items.type","type1"]},
                "then": 1,
                "else": 0
            }
        }
      }
    }
  }
])

There are 2 major changes:

  1. moving the items.deleted : false condition into the $lookup subpipeline to lookup less items documents
  2. skipped items: { $push: '$items' }. Instead, do a conditional sum in later $group stage

Here is the Mongo playground for your reference. (at least for the correctness of the new query)

Approach 2: If the collection schema can be modified. We can denormalize projects.workspace into the items collection like this:

{
    "_id": "i1",
    "project": "p1",
    "workspace": "workspaceId1",
    "type": "type1",
    "deleted": false
}

In this way, you can skip the $lookup. A simple $match and $group will suffice.

db.items.aggregate([
  {
    $match: {
      "deleted": false,
      "workspace": "workspaceId1"
    }
  },
  {
    $group: {
      _id: "$project",
      itemType1Count: {
        $sum: {
          "$cond": {
            "if": {$eq: ["$type","type1"]},
            "then": 1,
            "else": 0
          }
        }
      },
      ...

Here is the Mongo playground with denormalized schema for your reference.

ray
  • 11,310
  • 7
  • 18
  • 42
  • Thanks for the answer. Approach 1 looks much cleaner, but, unfortunately, it's execution time almost the same as for my original query ~8 seconds on the same data. I may notice, that `items` collection is quite big (about 700K records). Maybe it's maximum possible result for such aggregation? – Kuzzy Oct 13 '21 at 09:42
  • Can you identify whether it is the lookup that is slow? If that is the case, you might want to add a composite index to items on project and deleted field. That should make the initial $match faster – ray Oct 13 '21 at 09:46
  • With compound index it works even slower for 1 second :( – Kuzzy Oct 13 '21 at 11:55
  • Can you try to use [explain](https://docs.mongodb.com/manual/reference/method/cursor.explain/) to check your query execution plan? My answer can only serve as some intuitive directions for what we can try in the aspect of the code. We may need more information like index usage to provide more help. – ray Oct 13 '21 at 12:49
  • 1
    Explain shows execution plan only for first stage with `$match`. And yes, I'm completely sure that slowest part of pipeline is `$lookup`, all other stages works very fast. Also, Interesting thing that this query with the same data works 2x faster on production env than in local. Perhaps it because more free RAM available for indexes? In any way, I've decided to try second approach and modify data structure. It will help to make other queries for reports faster. Thanks for the help, @ray – Kuzzy Oct 14 '21 at 11:49