2

I have faced a weird issue while querying one of our collections and aggregating the result out to another collection. I am querying unique users who have done some action and projecting the count of action performed per user in the aggregation query.

var result = db.getCollection('user_actions').aggregate(
[
  {"$match":{"createdAt":{"$gte":1585161000000,"$lt":1585247340000}}},
  {"$group":{"_id":{"accountId":"$user.id"},"count":{"$sum":1}}},
  {"$sort": {"count": -1}},
  {"$project":{"_id":0,"count":"$count","accountId":"$_id.accountId"}}
]
)
db.getCollection('winners').insert(result.toArray())
db.getCollection('winners').find({}).count()

The result of the above query inserts 12643 documents in the newly created collection.

I have also tried the query with allowDiskUse and cursor initial batch size value as well. But the stats of the collection as an output of the aggregation pipeline remained the same and neither am I getting any exceptions.

{
    "ns" : "data.winners",
    "size" : 1121979,
    "count" : 12639,
    "avgObjSize" : 88,
    "storageSize" : 4096,
    "capped" : false,
    "wiredTiger" : {...},
    "nindexes" : 1,
    "totalIndexSize" : 4096,
    "indexSizes" : {
        "_id_" : 4096
    },
    "ok" : 1,
    "operationTime" : Timestamp(1585321431, 41),
    "$gleStats" : {...},
    "lastCommittedOpTime" : Timestamp(1585321431, 41),
    "$configServerState" : {...},
    "$clusterTime" : {...   }
}

That was more of a plot setting, now, when I remove the sort query from the aggregation, it produces 26397 documents as a result. I could pick up some individual accounts from this result and verify that this is the correct result I should have got in the first place(sorted additionally). I have tried looking into MongoDB Limits and Thresholds but couldn't really find a relation to the observation.

Database version

db.version()
4.0.0

Shell details

mongo -version
MongoDB shell version v4.2.1
git version: edf6d45851c0b9ee15548f0f847df141764a317e
allocator: system
modules: none
build environment:
    distarch: x86_64
    target_arch: x86_64

Edit: Worth mentioning as discussed with Valijon in comments, using another stage in the aggregation pipeline to count the documents processed:

{"$count": "winners"}

results in the same value 26397 with and without the {"$sort": {"count": -1}} stage.


It doesn't look to me like a sort memory limit behaviour either since that would have thrown an error according to the documentation.

Naman
  • 27,789
  • 26
  • 218
  • 353
  • did you try to use $out stage directly in your aggregation query, instead of manually inserting result? – matthPen Mar 28 '20 at 07:18
  • @matthPen Yes. I did but with the same outcome. As I mentioned, the difference was mainly observed due to the removal of `sort`, I suspect its a processing threshold or limit that is being reached somewhere. – Naman Mar 28 '20 at 07:26
  • Sometimes [.count()](https://docs.mongodb.com/manual/reference/method/db.collection.count/) doesn't perform predict and return results based on the collection’s metadata. Can you add `$count` as last stage for `user_actions` and (without `$count`) after you insert `winners`? It's better use `$out` which is faster (performed directly in MongoDB). – Valijon Mar 28 '20 at 10:35
  • @Valijon It's not a matter of the `.count` producing incorrect results since I could extract the entire collection as a CSV and validate that the `count` reported is actually correct. Related to `$out` I had, in reality, made use of that stage in aggregation as well but it doesn't create a difference. – Naman Mar 28 '20 at 12:47
  • @Naman can share your data so we can reproduce the issue? – Valijon Mar 28 '20 at 12:52
  • @Valijon Not possible for privacy as well as the data size concerns. But if you're looking for any information around the query or operation do let me know. I would update the question with it. – Naman Mar 28 '20 at 13:14
  • Can you share what you get with `$count` for `user_actions` aggregation and after insert `winners` – Valijon Mar 28 '20 at 13:53
  • @Valijon Edited the question, the `$count` stage used returns the correct value with and without `sort` stage. So it tends to process threshold/limitation as I was guessing previously as well. – Naman Mar 28 '20 at 16:34
  • Couldn't it be a problem of the statistics themselves? Why don't you try to `validate` the collection before getting its statistics. – metaphori Mar 30 '20 at 20:43
  • @metaphori Good point. I have used `mongoexport` to dump the collection as a CSV and validated the content as well. Hence confident of the numbers that I've put up in the question. – Naman Mar 31 '20 at 03:58
  • Can you try putting this data into an updated mongo version and repeating the experiment? There were a couple of bugs fixed concerning sorting in pipelines or in general between 4.0.0 and 4.0.17, most notably [this](https://jira.mongodb.org/browse/SERVER-42565) one – als Apr 01 '20 at 20:55
  • @als Thank you for the comment. To test that out in different versions would be a decent amount of effort and I am afraid I wouldn't have that much time in the near future. But I would keep that as a bookmark if there is no solution to the problem found in the meanwhile. The link that you shared states, *...find commands and aggregation pipelines do not guarantee the same sort order when at least one of the documents in a collection is missing at least one of the fields being sorted on.*, which makes me wonder if the query in question is eligible for the bug since I sort on grouped by count. – Naman Apr 02 '20 at 08:31
  • You dont necessarily have to test out every version. I would probably go for the latest 4.0.x release and if that changes nothing the latest overall stable release. You can spin up a new test instance pretty quick with docker and since it doesnt have to do anything more than that one aggregation you can just selectively dump out the collections that you need with [mongodump](https://docs.mongodb.com/manual/reference/program/mongodump/) – als Apr 02 '20 at 09:13
  • @als I too meant a single version upgrade. But if nothing else, I would try that as well on a later stage. The point is there is still no cause that I am aware of about this failure and that is what I was hoping for when I shared the content here. I am aware of the setup but it's not easy or a routine for me to upgrade a Mongo version in production(there are integrations). On the other hand, solve without sorting and in-memory processing has got me done with my job anyway. – Naman Apr 06 '20 at 00:52
  • @Naman Any update about the issue? I'm having exact problem right now. – Murat Colyaran Dec 15 '21 at 08:21

0 Answers0