0

i have 2 collections : words and phrases Each word document has an array of phrases id's. And each phrase can be active or inactive.

For example :

words :
{"word" => "hello", phrases => [1,2]}
{"word" => "table", phrases => [2]}

phrases :
{"id" => 1, "phrase" => "hello world!", "active" => 1}
{"id" => 2, "phrase" => "hello, i have already bought new table", "active" => 0}

I need to get count of active phrases for each word.

In php i do it like this:
1. get all words
2. for each word get count of active phrases with condition ['active' => 1]

Question: How can i get words with active phrases count in one request? I tried to use MapReduce, but i need to make a request for each word to get count of active phrases.

UPD: In my test collection there are 92 000 phrases and 23 000 words.

I have already tested both variant: with php loop for each word in which i get phrases count and aggreagation function in mongo.

But i changed aggregation pipeline in commets below because of phrases_data. It is array, so i can't use $match on it. I use $unwind after $lookup.

[ '$unwind'  =>  '$5'],
    [
        '$lookup' =>  [
        'from' =>  'phrases_926ee3bc9fa72b029e028ec90e282072ea0721d1',
            'localField' =>  '5',
            'foreignField' =>  '0',
            'as' =>  'phrases_data'
        ]
    ],
    [ '$unwind'  =>  '$phrases_data'],
    [ '$match'  =>  [ 'phrases_data.3'  =>  77] ], //phrases_data.3 => 77 it is similar to phrases_data.active => 1
    [ '$group'  =>  
        [
            '_id'  =>  ['word'  =>  '$1', 'id'  =>  '$0'],
            'active_count'  =>  [ '$sum'  =>  1]
        ]
    ],
    [ '$match'  =>  [ 'active_count'  =>  ['$gt' => 0]] ],
    [ '$sort'  =>
        [
            'active_count'  => -1
        ]
    ]


The problem is that $group command take 80% of process time. And it is much slower than php loop. Here is my results for test collection:

1. Php loop (get words-> get phrases count for each word): 10 seconds
2. Aggregation function : 20 seconds
s7vr
  • 73,656
  • 11
  • 106
  • 127

1 Answers1

0
db.words.aggregate([
    { "$unwind" : "$phrases"},
    {
        "$lookup": {
            "from": "phrases",
            "localField": "phrases",
            "foreignField": "id",
            "as": "phrases_data"
        }
    },
    { "$match" : { "phrases_data.active" : 1} },
    { "$group" : {
        "_id" : "$word",
        "active_count" : { $sum : 1 }
        }
    }
]);

You can use above aggregation pipeline :

  1. Unwind the phrases array from words collection documen as separate document
  2. do a lookup(join) in phrases collection using unwinded phrases
  3. filter the phrases and check for active using $match
  4. Finally group phrases by word and count using $sum : 1
Mihir Bhende
  • 8,677
  • 1
  • 30
  • 37