1

I have an existing collection with close to 1 million number of docs, now I'd like to append a new field data to this collection. (I'm using PyMongo)

For example, my existing collection db.actions looks like:

...
{'_id':12345, 'A': 'apple', 'B': 'milk'}
{'_id':12346, 'A': 'pear', 'B': 'juice'}
...

Now I want to append a new column field data to this existing collection:

...
{'_id':12345, 'C': 'beef'}
{'_id':12346, 'C': 'chicken'}
...

such that the resulting collection should look like this:

...
{'_id':12345, 'A': 'apple', 'B': 'milk', 'C': 'beef'}
{'_id':12346, 'A': 'pear', 'B': 'juice', 'C': 'chicken'}
...

I know we can do this with update_one with a for loop, e.g

for doc in values:
        collection.update_one({'_id': doc['_id']},
        {'$set': {k: doc[k] for k in fields}},
        upsert=True
    )

where values is a list of dictionary each containing two items, the _id key-value pair and new field key-value pair. fields contains all the new fields I'd like to add.

However, the issue is that I have a million number of docs to update, anything with a for loop is way too slow, is there a way to append this new field faster? something similar to insert_many except it's appending to an existing collection?

===============================================

Update1:

So this is what I have for now,

bulk = self.get_collection().initialize_unordered_bulk_op()
for doc in values:
    bulk.find({'_id': doc['_id']}).update_one({'$set': {k: doc[k] for k in fields} })

bulk.execute()

I first wrote a sample dataframe into the db with insert_many, the performance: Time spent in insert_many: total: 0.0457min then I use update_one with bulk operation to add extra two fields onto the collection, I got: Time spent: for loop: 0.0283min | execute: 0.0713min | total: 0.0996min

Update2:

I added an extra column to both the existing collection and the new column data, for the purpose of using left join to solve this. If you use left join you can ignore the _id field.

For example, my existing collection db.actions looks like:

...
{'A': 'apple', 'B': 'milk', 'dateTime': '2017-10-12 15:20:00'}
{'A': 'pear', 'B': 'juice', 'dateTime': '2017-12-15 06:10:50'}
{'A': 'orange', 'B': 'pop', 'dateTime': '2017-12-15 16:09:10'}
...

Now I want to append a new column field data to this existing collection:

...
{'C': 'beef', 'dateTime': '2017-10-12 09:08:20'}
{'C': 'chicken', 'dateTime': '2017-12-15 22:40:00'}
...

such that the resulting collection should look like this:

...
{'A': 'apple', 'B': 'milk', 'C': 'beef', 'dateTime': '2017-10-12'}
{'A': 'pear', 'B': 'juice', 'C': 'chicken', 'dateTime': '2017-12-15'}
{'A': 'orange', 'B': 'pop', 'C': 'chicken', 'dateTime': '2017-12-15'}
...
Sam
  • 475
  • 1
  • 7
  • 19
  • see `db.collection.initializeUnorderedBulkOp()` https://docs.mongodb.com/manual/reference/method/db.collection.initializeUnorderedBulkOp/ – Saravana Feb 05 '18 at 18:23
  • @Saravana For that I think I still need to do a for loop to create the `bulk.update`s and then run the `execute`. I had compared using `bulk`and `execute` with `insert` but it's still much slower than `insert_many`, I'm hoping to get a similar performance like `insert_many`. – Sam Feb 05 '18 at 18:32
  • Can't you simply create a pipeline stage for an aggregate with an $addfields stage, and run that with something like `db.mydb.values.aggregate(pipeline)`? – Michaël van der Haven Feb 05 '18 at 20:16
  • @MichaëlvanderHaven I will try to look into that (I'm kind of new to MongoDB, still getting familiar with its commands etc), – Sam Feb 05 '18 at 20:33
  • For setting up pipelines in the aggregation framework calls with python: http://api.mongodb.com/python/current/examples/aggregation.html For the `$addfields` stage, look here: https://docs.mongodb.com/manual/reference/operator/aggregation/addFields/ – Michaël van der Haven Feb 05 '18 at 20:36
  • are new column fields in a separate collection and has a relationship with actual collection? – Saravana Feb 06 '18 at 02:55
  • @Saravana It is in a separate collection. And it indeed has a relationship with the existing collection. I was thinking about using `$lookup` for a left join (if that's where you are going), but the key to join on is not straightforward to me, the key is a timestamp but I'd like to join the two collections by date, and as you may have seen, I'm not familiar with mongodb operations yet, so I stopped there. Plus, appending extra column data is more generic than joinings. – Sam Feb 06 '18 at 03:11
  • oh, can you update your sample documents with the date and timestamp fields? – Saravana Feb 06 '18 at 03:13
  • @Saravana Sure thing, so now if you want to use left join on the datetime, you can just ignore the `_id` field. I wonder how much faster using left join compare to bulk update. – Sam Feb 06 '18 at 03:25
  • how big the another collection which has a new fields? – Saravana Feb 06 '18 at 03:32
  • @Saravana The existing collection is still the same size, pretty large with ~1 million docs. The new one is pretty small, less than 1k. – Sam Feb 06 '18 at 03:34

2 Answers2

0

If your updates are really unique per document there is nothing faster than the bulk write API. Neither MongoDB nor the driver can guess what you want to update so you will need to loop through your update definitions and then batch your bulk changes which is pretty much described here:

Bulk update in Pymongo using multiple ObjectId

The "unordered" bulk writes can be slightly faster (although in my tests they weren't) but I'd still vote for the ordered approach for error handling reasons mainly).

If, however, you can group your changes into specific recurring patterns then you're certainly better off defining a bunch of update queries (effectively one update per unique value in your dictionary) and then issue those each targeting a number of documents. My Python is too poor at this point to write that entire code for you but here's a pseudocode example of what I mean:

Let's say you've got the following update dictionary:

{
    key: "doc1",
    value:
    [
        { "field1", "value1" },
        { "field2", "value2" },
    ]
}, {
    key: "doc2",
    value:
    [
        // same fields again as for "doc1"
        { "field1", "value1" },
        { "field2", "value2" },
    ]
}, {
    key: "doc3",
    value:
    [
        { "someotherfield", "someothervalue" },
    ]
}

then instead of updating the three documents separately you would send one update to update the first two documents (since they require the identical changes) and then one update to update "doc3". The more knowledge you have upfront about the structure of your update patterns the more you can optimize that even by grouping updates of subsets of fields but that's probably getting a little complicated at some point...

UPDATE:

As per your below request let's give it a shot.

fields = ['C']
values = [
    {'_id': 'doc1a', 'C': 'v1'},
    {'_id': 'doc1b', 'C': 'v1'},
    {'_id': 'doc2a', 'C': 'v2'},
    {'_id': 'doc2b', 'C': 'v2'}
]

print 'before transformation:'
for doc in values:
    print('_id ' + doc['_id'])
    for k in fields:
        print(doc[k])

transposed_values = {}
for doc in values:
    transposed_values[doc['C']] = transposed_values.get(doc['C'], [])
    transposed_values[doc['C']].append(doc['_id'])

print 'after transformation:'
for k, v in transposed_values.iteritems():
    print k, v

for k, v in transposed_values.iteritems():
    collection.update_many({'_id': { '$in': v}}, {'$set': {'C': k}})
dnickless
  • 10,733
  • 1
  • 19
  • 34
  • I see, that's good to know. The value of the new fields Im trying to update are not entirely unique. e.g. the field name is same, just all 'C', and the values only have certain categories i.e. 'v1', 'v2' and 'v3' that's all. Could you show a pseudo code in mongo commands? It seems like I don't have to split up the bulk operations inside the link you posted. – Sam Feb 05 '18 at 20:48
  • @Sam: Check out the updated answer. I am not sure my data structures match the ones you have but it should point you in the right direction. – dnickless Feb 05 '18 at 22:01
0

Since your join collection having less documents, you can convert the dateTime to date

db.new.find().forEach(function(d){
    d.date = d.dateTime.substring(0,10);
    db.new.update({_id : d._id}, d);
})

and do multiple field lookup based on date (substring of dateTime) and _id,

and out to a new collection (enhanced)

db.old.aggregate(
    [
        {$lookup: {
                from : "new",
                let : {id : "$_id", date : {$substr : ["$dateTime", 0, 10]}},
                pipeline : [
                    {$match : {
                        $expr : {
                            $and : [
                                {$eq : ["$$id", "$_id"]},
                                {$eq : ["$$date", "$date"]}
                            ]
                        }
                    }},
                    {$project : {_id : 0, C : "$C"}}
                ],
                as : "newFields"
            }
        },
        {$project : {
            _id : 1,
            A : 1,
            B : 1,
            C : {$arrayElemAt : ["$newFields.C", 0]},
            date : {$substr : ["$dateTime", 0, 10]}
        }},
        {$out : "enhanced"}
    ]
).pretty()

result

> db.enhanced.find()
{ "_id" : 12345, "A" : "apple", "B" : "milk", "C" : "beef", "date" : "2017-10-12" }
{ "_id" : 12346, "A" : "pear", "B" : "juice", "C" : "chicken", "date" : "2017-12-15" }
{ "_id" : 12347, "A" : "orange", "B" : "pop", "date" : "2017-12-15" }
> 
Saravana
  • 12,647
  • 2
  • 39
  • 57
  • I was getting this error `"errmsg" : "arguments to $lookup must be strings` but then after a quick search, this seems like caused by [different sytanx](https://stackoverflow.com/questions/47669976/mongodb-aggregate-arguments-to-lookup-must-be-strings) of `$lookup` in v3.4 vs v3.6, Then, installation on v.3.6 seems like stalls on win7, I'm going to attempt dnickless's solution first. – Sam Feb 06 '18 at 20:29