1

I got the following errors. when I tried to aggregated it by user_id or distinct on user_id

failed: exception: aggregation result exceeds maximum document size failed: exception: distinct too big, 16mb cap

I wonder know how to finish my tasks under very large collection ?

data format

{
    user_id: "Jack",
    SYMPTOM_1: "flu",
    SYMPTOM_2: "cough",
    SYMPTOM_3: "cancer",
    datetime: "20140101",
}

aggregation query

This query is tried to group users and append all the symptoms of medical records to each user

db.medical_records.aggregate([
    {
        "$sort": { "datetime": 1 }
    },
    {
        "$group": {
            "_id": "$user_id",
            "symptom1":{
                "$push": {"symptom": "$SYMPTOM_1" ,"date": "$datetime"}
            },
            "symptom2":{
                "$push": {"symptom": "$SYMPTOM_2" ,"date": "$datetime"}
            },
            "symptom3":{
                "$push": {"symptom": "$SYMPTOM_3" ,"date": "$datetime"}
            },                        
           "first_date": { "$first": "$datetime" },
           "user_id": { "$first": "$user_id" },
           "count": { "$sum": 1 }
        }
    },
    {
        "$project": {
            "user_id": "$user_id",
            "date": "$datetime",
            "symptom1": "$symptom1",
            "symptom2": "$symptom2",
            "symptom3": "$symptom3",
            "count": "$count",
            "_id": 1
        }
    }
],allowDiskUse=true)

Expected output

{u'user_id': u'de96dsdase303c6c6439891c57901183c0e4c',
   u'symptom1': [{u'symptom': u'1479 ', u'date': u'20040910'}],
   u'symptom2': [{u'symptom': u'     ', u'date': u'20040910'}],
   u'symptom3': [{u'symptom': u'     ', u'date': u'20040910'}],
   u'count': 1,
   u'first_date': u'20040910'}
Markus W Mahlberg
  • 19,711
  • 6
  • 65
  • 89
newBike
  • 14,385
  • 29
  • 109
  • 192
  • 1
    can you show what is the result for your aggregation looks like for a small set of documents? – Salvador Dali Apr 09 '15 at 08:17
  • I think that you can try to output the result of an aggregation framework to another collection with http://stackoverflow.com/a/19600746/1090562. I have not tested itt, hence a comment. Let me know whether it yields a resul – Salvador Dali Apr 09 '15 at 09:18
  • I think you should use $limit to reduce the returned document. As in MongoDB you can't write a query which returns a result greater than 16mb. – Shreyance Jain Apr 09 '15 at 09:21
  • @proc: Could you explain why you create an array for each anyway numbered symptom? Wouldn't it be better to have an array of symptoms? Furthermore, the `$project` stage is totally redundant. – Markus W Mahlberg Apr 09 '15 at 10:50
  • @MarkusWMahlberg because that is the original database design, I can not modify it. although it seems redundant. – newBike Apr 14 '15 at 16:04

1 Answers1

-1

It appears like you are trying to use the allowDiskUse option which would likely solve your problem, but unfortunately you seem to have a syntactical error.

When you pass options to an operation, you need to pass these as an object surrounded by { and }.

What you are doing here is assign true to a new global variable allowDiskUse and pass the result of that assign operation to aggregate, which is just the value true.

Try replacing ],allowDiskUse=true) with ], { allowDiskUse:true } )

This allows you to circumvent the 16MB limit per aggregation stage. But keep in mind that it will still be a very slow operation.

Philipp
  • 67,764
  • 9
  • 118
  • 153