18

I have some 25k documents (4 GB in raw json) of data that I want to perform a few javascript operations on to make it more accessible to my end data consumer (R), and I would like to sort of "version control" these changes by adding a new collection for each change, but I cannot figure out how to map/reduce without the reduce. I want a one-to-one document mapping—I start out with 25,356 documents in collection_1, and I want to end up with 25,356 documents in collection_2.

I can hack it with this:

var reducer = function(key, value_array) {
    return {key: value_array[0]}
}

And then call it like:

db.flat_1.mapReduce(mapper, reducer, {keeptemp: true, out: 'flat_2'})

(My mapper only calls emit once, with a string as the first argument and the final document as the second. It's a collection of those second arguments that I really want.)

But that seems awkward and I don't know why it even works, since my emit call arguments in my mapper are not equivalent to the return argument of my reducer. Plus, I end up with a document like

{
    "_id": "0xWH4T3V3R", 
    "value": {
        "key": {
            "finally": ["here"],
            "thisIsWhatIWanted": ["Yes!"]
        }
    }
}

which seems unnecessary.

Also, a cursor that performs its own inserts isn't even a tenth as fast as mapReduce. I don't know MongoDB well enough to benchmark it, but I would guess it's about 50x slower. Is there a way to run through a cursor in parallel? I don't care if the documents in my collection_2 are in a different order than those in collection_1.

yprez
  • 14,854
  • 11
  • 55
  • 70
chbrown
  • 11,865
  • 2
  • 52
  • 60
  • The reason it works is because your emit and reducer call *are* the same. Since you use value[0] as the output of your reducer then it has to be the exact same because you haven't changed it (it's just passing through your reducer). – null Aug 30 '10 at 23:47

5 Answers5

6

When using map/reduce you'll always end up with

{ "value" : { <reduced data> } }

In order to remove the value key you'll have to use a finalize function.

Here's the simplest you can do to copy data from one collection to another:

map = function() { emit(this._id, this ); }
reduce = function(key, values) { return values[0]; }
finalize = function(key, value) { db.collection_2.insert(value); }

Then when you would run as normal:

db.collection_1.mapReduce(map, reduce, { finalize: finalize });
null
  • 7,432
  • 4
  • 26
  • 28
3

But that seems awkward and I don't know why it even works, since my emit call arguments in my mapper are not equivalent to the return argument of my reducer.

They are equivalent. The reduce function takes in an array of T values and should return a single value in the same T format. The format of T is defined by your map function. Your reduce function simply returns the first item in the values array, which will always be of type T. That's why it works :)

You seem to be on the right track. I did some experimenting and it seems you cannot do a db.collection.save() from the map function, but you can do this from the reduce function. Your map function should simply construct the document format you need:

function map() {
  emit(this._id, { _id: this.id, heading: this.title, body: this.content });
}

The map function reuses the ID of the original document. This should prevent any re-reduce steps, since no values will share the same key.

The reduce function can simply return null. But in addition, you can write the value to a separate collection.

function reduce(key, values) {
  db.result.save(values[0]);

  return null;
}

Now db.result should contain the transformed documents, without any additional map-reduce noise you'd have in the temporary collection. I haven't actually tested this on large amounts of data, but this approach should take advantage of the parallelized execution of map-reduce functions.

Niels van der Rest
  • 31,664
  • 16
  • 80
  • 86
  • 2
    This way took 523s and ended up with a collection exactly as I wanted it, whereas the hackish way I described in the question takes 319s. It's unfortunate I can't just call `db.coll.mapReduce(myMapperFunc, null, {'out': 'output'})`. I think reduce is able to batch-save/insert a whole set of items; I think the bottleneck here is the `save()` called in every reduce. – chbrown Aug 28 '10 at 18:24
  • 1
    @chbrown: Yes, the `save()` is done twice for each document; the standard reduce-save to the temporary collection, and the explicit save to a separate collection. Just curious, is this solution actually faster than using a single cursor? – Niels van der Rest Aug 28 '10 at 19:49
  • Hi All, we have a similar problem to handle large data sets and since array concatenation and rtruning large documents in reduce not working, we have followed the above mentioned appaorach of saving the document in seperate collection and returning null from reduce. Its working fine but db is getting hanged when we do any other operation while running mapreduce. is there any better appraoch for the same. – MRK Aug 28 '12 at 11:38
1

When you got access to the mongo shell, it accepts some Javascript commands and then it's simpler:

map = function(item){
        db.result.insert(item);
}

db.collection.find().forEach(map);
0

I faced the same situation. I was able to accomplish this via Mongo query and projection. see Mongo Query

CAMPSMITH
  • 25
  • 5
0

Only map without reduce is like copying a collection: http://www.mongodb.org/display/DOCS/Developer+FAQ#DeveloperFAQ-HowdoIcopyallobjectsfromonedatabasecollectiontoanother%3F

TTT
  • 2,365
  • 17
  • 16