2

I am completely new to MongoDB and wanted to compare query performance of a NoSQL data model relative to its relational database counter part. I wrote this into MongoDB shell

// Make 10 businesses
// Each business has 10 locations
// Each location has 10 departments
// Each department has 10 teams
// Each team has 100 employees
(new Array(10)).fill(0).forEach(_=>
    db.businesses.insert({
        "name":"Business Name",
        "locations":(new Array(10)).fill(0).map(_=>({
            "name":"Office Location",
            "departments":(new Array(10)).fill(0).map(_=>({
                "name":"Department",
                "teams":(new Array(10)).fill(0).map(_=>({
                    "name":"Team Name",
                    "employees":(new Array(100)).fill(0).map(_=>({
                        "age":Math.floor(Math.random()*100)
                    }))
                }))
            }))
        }))
    })
);

Then I attempted the equivalent of MySQL's EXPLAIN SELECT age,name,(and a few other fields) FROM employees WHERE age >= 50 ORDER BY age DESC by writing this statement:

db.businesses.aggregate([
    { $unwind: "$locations" },
    { $unwind: "$locations.departments" },
    { $unwind: "$locations.departments.teams" },
    { $unwind: "$locations.departments.teams.employees" },
    { $project: { _id: 0, age: "$locations.departments.teams.employees.age" } },
    { $match: { "age": { $gte: 50 }} },
    { $sort: {"age" : -1}}
]).explain("executionStats")

The result was:

"errmsg" : "Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in.",

So I deleted the sort clause and try to get an explain. But the result was:

TypeError: db.businesses.aggregate(...).explain is not a function

So my questions are:

  1. Primarily, I want to know the performance difference of SELECT age FROM employees WHERE age >= 50 ORDER BY age DESC when compared with the MongoDB's aggregate query counter part. Is it more or less the same? Will one be substantially faster or more performant than the other?

  2. Alternatively, how do I fix my MongoDB query so that I can get performance details to compare against my MySQL query counter part?

John
  • 32,403
  • 80
  • 251
  • 422

4 Answers4

2

Employees are single entities; thus, you probably don't want to model age of a team member so deeply in the rich structure of departments and locations and teams. It is perfectly fine to have a separate employees collection and simply do:

db.businesses.aggregate([
{$match: {"age": {$gt: 50} }}
,{$sort: {"age": -1} }
]);

Deep in your businesses collection you can have:

{ teams: [ {name: "T1", employees: [ "E1", "E34" ]} ] }

Alternately, try this:

db.businesses.aggregate([ your pipeline] ,{allowDiskUse:true});

The OP has a setup of 10 biz -> 10 loc -> 10 depts -> 10 teams -> 100 emps. The first 3 unwinds creates a 10000x explosion of data but the last one is 100x beyond that. We can shrink the hit by using $filter:

db.businesses.aggregate([
{ $unwind: "$locations" },
{ $unwind: "$locations.departments" },
{ $unwind: "$locations.departments.teams" },

{$project: {
        XX: {$filter: {
                    input: "$locations.departments.teams.employees",
                    as: "z",
                    cond: {$gte: [ "$$z.age", 50] }
            }}
    }}
,{$unwind: "$XX"}
,{$sort: {"XX.age":-1}}])
John
  • 32,403
  • 80
  • 251
  • 422
Buzz Moschetti
  • 7,057
  • 3
  • 23
  • 33
  • Won't having employees as separate entities bring me back to my original problem of this https://stackoverflow.com/questions/59080541/have-embedded-documents-instead-of-relying-on-foreign-keys ? Which is what spawned a series of 4-5 questions that got me to this current question. – John Nov 28 '19 at 17:11
  • And results of my experiment, the mongodb query took 3 seconds when I used `{allowDiskUse:true}`. The MySQL `SELECT age FROM t_employee WHERE age => 50 ORDER BY age DESC` took 1.2 seconds. My guess is MySQL doesn't enable `allowDiskUse` by default. No indices were applied in either scenario. Is this result expected for this scenario? Others also see MongoDB being 3x slower than MySQL? – John Nov 28 '19 at 17:26
  • Working on a `map` based alternative to bypass the 100X explosion of `employee` at the end.... – Buzz Moschetti Nov 28 '19 at 17:32
  • Thanks, with the filter approach plus the `{allowDiskUse:true}`, the MongoDB query took 2 seconds. So that's good improvement from the 3 seconds! – John Nov 28 '19 at 20:32
1

You better move $match to the first pipeline, because aggregation framework loses index after first pipeline, also i guess you don't need to unwind those arrays.

Okkano
  • 230
  • 2
  • 9
0

i was able to get a result in 1.5 seconds without any indexes by modifying the query like the following:

db.businesses.aggregate([
    {
        $unwind: "$locations"
    },
    {
        $unwind: "$locations.departments"
    },
    {
        $unwind: "$locations.departments.teams"
    },
    {
        $unwind: "$locations.departments.teams.employees"
    },
    {
        $match: {
            "locations.departments.teams.employees.age": {
                $gte: 50
            }
        }
    },
    {
        $project: {
            _id: 0,
            age: "$locations.departments.teams.employees.age"
        }
    },
    {
        $group: {
            _id: "$age"
        }
    },
    {
        $project: {
            _id: 0,
            age: "$_id"
        }
    },
    {
        $sort: {
            "age": - 1
        }
    }
], {
    explain: false
})
Dĵ ΝιΓΞΗΛψΚ
  • 5,068
  • 3
  • 13
  • 26
  • Thanks, the `{$group:{_id:"$age"}}` effectively does the equivalent of `GROUP BY age` right? Which means if my employee has a name field, then I won't accurately get back the name of the employee along with their age (ie. `SELECT name,age FROM employee`)? – John Nov 28 '19 at 16:41
  • When I remove the `{$group:_id:"$age"}`, but leave in the sort filter, my mongo still crashes from not-enough memory. I don't have the same problem with MySQL. No indices were applied in either scenario. – John Nov 28 '19 at 16:43
  • In my MySQL `SELECT name,age FROM employee WHERE age>=50 ORDER by name DESC` also takes about 1 second. No indices were applied. Still can't get the MongoDB equivalent to not crash... – John Nov 28 '19 at 16:51
0

There is another way to address the overall problem, although it is not apples to apples with the OP question. The goal is to find all age >= 50 and sort. Below is an example that "almost" does so and throws in the loc,dept,team as well in case you were wondering how to get that too, but you can take out the lines to get just the emps. Now, this is unsorted -- but an argument can be made that the DB engine isn't going to do any better job of sorting this than the client and all the data has to come over the wire anyway. And the client can use more sophisticated coding tricks to dig thru to the age field and sort it.

c = db.foo.aggregate([
{$project: {XX:
  {$map: {input: "$locations", as:"z", in:
          {$map: {input: "$$z.departments", as:"z2", in:
                  {$map: {input: "$$z2.teams", as:"z3", in:
                          {loc: "$$z.name",  // remove if you want
                           dept: "$$z2.name", // remove if you want
                           team: "$$z3.name",  // remove if you want
                           emps: {$filter: {input: "$$z3.employees",
                                     as: "z4",
                                     cond: {$gt: [ "$$z4.age", 50] }
                                    }}
                          }
                      }}
              }}
      }}
    }}
]);

ages = [];

c.forEach(function(biz) {
    biz['XX'].forEach(function(locs) {
        locs.forEach(function(depts) {
            depts.forEach(function(teams) {
                teams['emps'].forEach(function(emp) {
                    ages.push(emp['age']);
                                    });
                            });
                    });
            });
    });

print( ages.sort(function(a, b){return b-a}) );

99,98,97,96,95,94,92,92,84,81,78,77,76,72,71,67,66,65,65,64,63,62,62,61,59,59,57,57,57,56,55,54,52,51

On a MacBook Pro running MongoDB 4.0, we see the collection as follows:

Collection            Count   AvgSize          Unz  Xz  +Idx     TotIdx  Idx/doc
--------------------  ------- -------- -G--M------  --- ---- ---M------  -------
                 foo       10   2238682     22386820  4.0    0      16384    0

Given the random age between 0 and 100, it is not surprising that every loc/dept/team has age >= 50 and that the total number of bytes returned is about half. Note, however that the total time to set up the agg -- not return all the bytes -- is ~700 millis.

697 millis to agg; 0.697
found 10
tot bytes 11536558
Buzz Moschetti
  • 7,057
  • 3
  • 23
  • 33
  • Also on your machine, does your query performance match that of MySQL? – John Nov 29 '19 at 15:15
  • I only have postgres. :-) And I wouldn't know the "best" way to model SQL to be performant. I could do a 4 way left outer join biz->loc->dept->teams->emps and filter for age >= 50 but I just haven't tried. – Buzz Moschetti Nov 29 '19 at 16:24
  • In my secnario, no joins would be necessary, because i'd only be selecting on fields from the employees table, and there's no need to reference any of the other tables. So you're simply doing a `SELECT name from employees WHERE age >= 50 ORDER by age DESC`. I'm curious how Postgres performs relative to MongoDB query you showed above! – John Nov 29 '19 at 16:36
  • Oh! I thought the goal was to identify emps >= 50 who were on teams (based on a sample nested `SELECT` example in your other post), not just the basic emps >= 50. No? – Buzz Moschetti Nov 29 '19 at 23:21
  • That's correct, i just want the names of emps with age >= 50 sorted in descending order of age. I don't need details of their business, locations, departments or teams. I can't find a MongoDB query that can perform as fast or efficiently as `SELECT name FROM employees WHERE age >= 50 ORDER BY age DESC` – John Nov 30 '19 at 14:45
  • With 1m employees in postgres: `mydb=# explain analyze select age from employees where age >= 50; QUERY PLAN ------------------------------------------------------------------------------------------------------------------- Seq Scan on employees (cost=0.00..17906.00 rows=511333 width=4) (actual time=0.013..128.348 rows=509957 loops=1) Filter: (age >= 50) Rows Removed by Filter: 490043 Planning time: 0.043 ms Execution time: 149.707 ms` – Buzz Moschetti Nov 30 '19 at 15:56
  • That's interesting, seems like postgres is actually faster than mysql. I'll try some tests later on my end to. As for MongoDB, the reason I don't put employees in a separate collection is because of the restriction in this question here: https://stackoverflow.com/questions/59080541/have-embedded-documents-instead-of-relying-on-foreign-keys . At the end of the day, I need to be able to perform BOTH simple selects and more complicated sub queries at the same time, and seems like relational is better at meeting both requirements simultaneously. – John Nov 30 '19 at 22:12