1

I have energy_carrier documents that are nested in a collection energy_ carrier_groups. I reference those energy_carrier documents from another collection tech and would like to resolve the reference with a $lookup aggregation.

=> How can I define a sub query inside $lookup that preprocesses /unwinds the energy carriers before I do the actual join/lookup?

My preferred way would be to specify a path for the from or foreignField option to target the nested documents of the energy_carrier_groups collection:

"from": "energy_carrier_groups.energy_carriers"

or

"from": "energy_carrier_groups"
"foreignField": "energy_carriers._id". 

However, that does not seem to work.

I found that $lookup supports let and pipeline arguments as an alternative to the options localField and foreignField (since Version 3.6) and that might be the way to go.

https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/#join-conditions-and-subqueries-on-a-joined-collection

Since version 5.0 its also possible to combine all four options:

https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/#correlated-subqueries-using-concise-syntax

(A requirement for mongodb version 5.0 is a CPU that supports AVX and mine doesn't.)

All that options make knots in my head! Could you please tell me how I should formulate the pipeline to resolve my reference from tech to energy carrier? Can't be that difficult, could it?

Code to create some example data:

import bson
from pymongo import MongoClient


def unique_id():
    return bson.objectid.ObjectId()


client = MongoClient(host='localhost', port=27017)
database = client.forecast

referenced_id = unique_id()

# create referenced collection
energy_carrier_groups = database.energy_carrier_groups
energy_carrier_groups.delete_many({})
energy_carrier_groups.insert_many([
    {
        '_id': unique_id(),
        'name': 'fuels',
        'energy_carriers': [
            {
                '_id': referenced_id,
                'name': 'oil'
            },
            {
                '_id': unique_id(),
                'name': 'gas'
            }
        ]
    },
    {
        '_id': unique_id(),
        'name': 'electricity',
        'energy_carriers': [
            {
                '_id': unique_id(),
                'name': 'green electricity'
            },
            {
                '_id': unique_id(),
                'name': 'conventional electricity'
            }
        ]
    },

])

# create referencing collection
tech = database.tech
tech.delete_many({})
tech.insert_many([
    {
        '_id': unique_id(),
        'name': 'qux',
        'energy_carrier': referenced_id
    },

])

Expected result of my aggregation:

{
    '_id': ObjectId('6183de1b5dd889cfcdeaa711'), 
    'name': 'qux', 
    'energy_carrier': {
        '_id': ObjectId('6183de1b5dd889cfcdeaa70b'), 
        'name': 'oil'
    }
}

First trial, using a path to the nested document:

pipeline = [
    {"$match": {"name": 'qux'}},
    {"$lookup": {
       "from": "$energy_carrier_groups.energy_carriers", # <= does not work 
       "localField": "energy_carrier",
       "foreignField": "_id",
       "as": "energy_carrier"
      }
    },
    {"$unwind": "$energy_carrier"},  # transforms lookup result array to a single entry
]
results = referencing.aggregate(pipeline)

for result in results:
    print(result)

print('finished')

Another trial, using let and pipeline instead of localField and foreignField:

pipeline = [
    {"$match": {"name": 'qux'}},
    {"$lookup": {
       "from": "energy_carrier_groups",
       "let": {"tech_energy_carrier_id": "$energy_carrier"},
       "pipeline": [
           {"$unwind": "$energy_carriers"},
           {"$match": {"$expr": {"$eq": ["$$tech_energy_carrier_id", "$energy_carriers._id"]}}}
       ],
       "as": "energy_carrier"  # overrides id field with an array wrapping the resolved reference
      }
    },
    {"$unwind": "$energy_carrier"},  # transforms array to a single entry
]
results = tech.aggregate(pipeline)

for result in results:
    print(result)

print('finished')

Gives some result but resolves the reference with "filtered energy carrier groups" instead of resolving only the energy carrier.

=> What is the recommended way to resolve the referenced energy carrier for tech?

=> If there is a better No-Sql Database then MongoDb for this porpose, please let me know, too

Related:

https://softwarerecs.stackexchange.com/questions/81175/is-there-an-alternative-to-mongodb-that-allows-to-easily-resolve-document-refere

MongoDB $lookup on nested document

$lookup on ObjectId's in an array

https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/#join-conditions-and-subqueries-on-a-joined-collection

https://www.stackchief.com/tutorials/%24lookup%20Examples%20%7C%20MongoDB

Stefan
  • 10,010
  • 7
  • 61
  • 117

2 Answers2

1

You can try,

  • let to pass energy_carrier id to pipeline
  • $match to check expression condition using $in operator, is energy_carrier in energy_carriers._id
  • $project to show required fields
  • $filter to iterate loop of energy_carriers array and filter by variable energy_carrier passed in let
  • $first to get the first element from above filtered result
  • $addFields and $first to get the first element from the above lookup result
pipeline = [
  { $match: { name: "qux" } },
  {
    $lookup: {
      from: "energy_carrier_groups",
      let: { energy_carrier: "$energy_carrier" },
      pipeline: [
        {
          $match: { $expr: { $in: ["$$energy_carrier", "$energy_carriers._id"] } }
        },
        {
          $project: {
            _id: 0,
            energy_carriers: {
              $first: {
                $filter: {
                  input: "$energy_carriers",
                  cond: { $eq: ["$$energy_carrier", "$$this._id"] }
                }
              }
            }
          }
        }
      ],
      as: "energy_carrier"
    }
  },
  {
    $addFields: { energy_carrier: { $first: "$energy_carrier.energy_carriers" } }
  }
]

results = tech.aggregate(pipeline)

Playground

turivishal
  • 34,368
  • 7
  • 36
  • 59
  • 1
    Thank you! This yields the expected result. At the same time, it seems to be quite a complex solution for a "simple" task. If that is the recommended solution, mongodb aggregations seem to have a poor API. – Stefan Nov 08 '21 at 07:49
  • If you know a better No-SqL Database for this purpose, please let me know, too: https://softwarerecs.stackexchange.com/questions/81175/is-there-an-alternative-to-mongodb-that-allows-to-easily-resolve-document-refere – Stefan Nov 08 '21 at 08:06
  • Yes, of course, this is a complex solution, but there are no other simple ways to achieve your requirement on this database schema structure. I can not say mongodb has poor API but you have to improve your schema structure, reduce collections/tables, means avoid lookup/join. – turivishal Nov 08 '21 at 08:23
  • Is it ok to duplicate data in a NoSQL database, Suppose you have 50 tables in a SQL database, coming to NoSQL world, you should probably cut them down to 5 or 10 tables, depending on the trade-off between performance and storage, Higher performance requires more storage. But today storage is cheaper and cheaper. So denormalization has the bigger advantage. – turivishal Nov 08 '21 at 08:25
  • Its not only about performance and storage but also maintenance effort of redundant model structures and consistency. I already try to apply references sparingly. Let's say I put everything in a single, huge collection with > 50 nested levels. If I would like to replace an energy carrier at 1000 different locations, the corresponding aggregation query would get even more complex. If I model the energy carrier groups by tags instead of parent document, I would have to ensure that those unique tags are used propertly. Maybe I need to learn more about validation/how to use kind of "enum" values. – Stefan Nov 08 '21 at 08:40
  • Is there a specific reason why MongoDb does not support paths in "from" or "foreignField" properties? That would make the API usage more powerful/efficient? – Stefan Nov 08 '21 at 08:43
  • *why MongoDb does not support paths in "from" or "foreignField" properties?* => because mongodb is no-sql, it can not recognize the reference collection, and can't join nested collection by path. for suggest or improvement or questions you can post the topic in [mongodb developer forum](https://www.mongodb.com/community/forums/). – turivishal Nov 08 '21 at 08:55
  • It seems that the first stage of the lookup pipline (=$match) can be removed without influencing the result. – Stefan Nov 08 '21 at 14:40
  • $unwind will deconstruct the array, for ex: if one document's array has 10 elements then it will generate 10 root documents after unwind, so it is so expensive transaction, it will occupy more memory and more process. – turivishal Nov 08 '21 at 15:56
  • The first $match stage is to filter main documents, and the better option is to filter elements as per condition in any projection stage. – turivishal Nov 08 '21 at 15:57
0

A. Here is another version, based on my initial trial to use lookup and combined with the addFields trick of turivishal to override/correct the resulting property energy_carrier.

pipeline = [
    {"$match": {"name": 'qux'}},
    {"$lookup": {
       "from": "energy_carrier_groups",
       "let": {"energy_carrier_id": "$energy_carrier"},  # executed on tech
       "pipeline": [  # executed on energy_carrier_groups, with the knowledge of let definition
           {"$unwind": "$energy_carriers"},
           {"$match": {"$expr": {"$eq": ["$$energy_carrier_id", "$energy_carriers._id"]}}}
       ],
       "as": "energy_carrier"  # already includes what we want but also extra fields
      }
    },
    {"$addFields": {  # overrides/corrects the result of the previous stage with parts of it
        "energy_carrier": {"$first": "$energy_carrier.energy_carriers"}
    }
  }
]
results = tech.aggregate(pipeline)

B. Alternatives to MongoDb (e.g. RethinkDb) might be better suited for complex queries

Also see https://softwarerecs.stackexchange.com/questions/81175/is-there-an-alternative-to-mongodb-that-allows-to-easily-resolve-document-refere

Stefan
  • 10,010
  • 7
  • 61
  • 117