0

I have a bunch of documents in a MongoDB collection, each of which looks like this:

{
"_id" : ObjectId("539f5556e4b032123458ba30"),
"name" : "H0031324836",
"date" : ISODate("2014-04-01T04:00:00Z"),
"dateString" : "2014-04-01",
"elements" : [
    {
        "start_time" : ISODate("2014-04-01T15:00:00Z"),
        "end_time" : ISODate("2014-04-01T16:00:00Z"),
        "duration" : NumberLong(3600000),
        "value" : 0.6968
    },
    {
        "start_time" : ISODate("2014-04-01T16:00:00Z"),
        "end_time" : ISODate("2014-04-01T17:00:00Z"),
        "duration" : NumberLong(3600000),
        "value" : 1.4873
    },
    // ...
]
}

For each of these documents, I want (through the aggregation framework, ideally) to end up with a document like this:

{
"_id" : ObjectId("539f5556e4b032123458ba30"),
"name" : "H0031324836",
"date" : ISODate("2014-04-01T04:00:00Z"),
"dateString" : "2014-04-01",
"duration" : NumberLong(...blahblah...), // sum of all "$duration" fields
"value" : ...blahblah..., // sum of all "$value" fields
}

I'm not seeing a way to vectorize over the $elements array and pick out values, though - perhaps $unwind is an option, but that seems pretty inefficient if it truly explodes to a stream of documents just so I can implode them again.

The collection is large (around half a billion docs now, will be several billion when the full data is loaded), so I'm hoping to use the aggregation framework and not MapReduce if possible.

I've got MongoDB 2.6.0, on a hash-sharded collection with 8 shards.

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
Ken Williams
  • 22,756
  • 10
  • 85
  • 147
  • `$unwind` is the answer: it's the only way you're going to work with `elements[]` via `$group` and `$sum`. If indices are applicable, remember that they're used only up until the first `$unwind`, after that, you're out of luck. – zamnuts Jun 17 '14 at 14:13
  • @zamnuts This is correct. Would you like to write this as a proper answer? – Philipp Jun 17 '14 at 14:14
  • @Philipp, will do, ty, sometimes i'm lazy, just comment, and move on :P – zamnuts Jun 17 '14 at 14:15

1 Answers1

4

$unwind is the answer, as you have stated. It is the way something like this was intended to be dealt with.

First, a bit about $unwind. Note that the aggregation framework will only utilize your collection's indices up until the point where it mutates the documents, so be sure to handle the bulk of your filtering foremost. More on that in the SO answer for "Aggregation pipeline and indexes". $unwind will behave superbly performance-wise since it is an optimized internal of MongoDB - it happens natively (C++) in the aggregation pipeline so you won't get degraded performance by running interpreted JavaScript (i.e. in MR). The MongoDB team has worked hard and over several iterations to ensure aggregation is fast.

Now what would this pipeline actually look like?

db.collection.aggregate([
    { $match: { name: "H0031324836" } }, // limit to just this record (or set of records, uses index)
    { $unwind: "$elements" }, // explode that array into individual documents
    {
        $group: { // regroup all of the similar ones based on the `name` field
            _id: "$name",
            duration: { $sum: "$elements.duration" }, // sum elements[n].duration
            value: { $sum: "elements.sum" } // sum elements[n].value
        }
    }
]);

See the SO answer for "$unwind an object in aggregation framework" for more insight into the pipeline.

Aggregation will definitely take advantage of your 8 shards, "Aggregation Introduction - Additional Features and Behaviors":

The aggregation pipeline supports operations on sharded collections. See Aggregation Pipeline and Sharded Collections.

Community
  • 1
  • 1
zamnuts
  • 9,492
  • 3
  • 39
  • 46
  • Thanks @zamnuts. I'll give that a shot and see how it performs in the field. – Ken Williams Jun 17 '14 at 15:13
  • BTW, I should have mentioned that I'm also open to redesign of the original documents - there's probably a tradeoff between this "compressed" schema, an even more compressed schema that has parallel arrays for the 4 `$elements` values, and a fully exploded schema that essentially pre-`$unwind`s the data. – Ken Williams Jun 17 '14 at 15:16
  • @KenWilliams There are some techniques for time-series data in MongoDB if you want to do the pre-`$unwind` schema. Check out [this SE DBA question](http://dba.stackexchange.com/questions/14533/how-should-i-store-time-series-in-mongodb) for a place to start (which links to http://docs.mongodb.org/ecosystem/use-cases/storing-log-data/). – zamnuts Jun 17 '14 at 15:33