18

I need to ignore duplicate inserts when using insert_many with pymongo, where the duplicates are based on the index. I've seen this question asked on stackoverflow, but I haven't seen a useful answer.

Here's my code snippet:

try:
    results = mongo_connection[db][collection].insert_many(documents, ordered=False, bypass_document_validation=True)
except pymongo.errors.BulkWriteError as e:
    logger.error(e)

I would like the insert_many to ignore duplicates and not throw an exception (which fills up my error logs). Alternatively, is there a separate exception handler I could use, so that I can just ignore the errors. I miss "w=0"...

Thanks

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
vgoklani
  • 10,685
  • 16
  • 63
  • 101
  • Even with `ordered=False` Bulk "inserts" still throw errors, even though the whole batch actually commits. The option is up to you whether to `try .. except` and essentially "ignore" the duplicate key error, or if you really don't want to like with that, then use "upserts" instead. That does require what is effectively a "find" on each document, but by nature is "cannot" create a duplicate key. It's just how it works. – Neil Lunn Jun 30 '17 at 04:00
  • How do I ignore the specific "duplicate key" error? I don't want to inadvertently ignore other errors. – vgoklani Jun 30 '17 at 04:03
  • Well the `BuklWriteError` or whatever the particular class is in python ( need to look that up ) with list each error in an array. Those entries have an error code which `E11000` off the top of my head. Simply process and ignore those, and of course really "thow/complain/log/whatever" on any other code present. – Neil Lunn Jun 30 '17 at 04:05
  • This is the error string: "batch op errors occurred" which is not very specific. – vgoklani Jun 30 '17 at 04:07
  • Give me a moment to reproduce one. All API's should have basically the same thing. The "stringified" form will generally be be "just a string", but there is actually more specific info in the object when you inspect it. Is for other languages so I don't see with python would be any different. – Neil Lunn Jun 30 '17 at 04:10
  • 1
    Dear S.M.Styvane, Yes this question has been asked before, unfortunately none of the answers were satisfactory. Hence the reason for re-posting. But in this case, the answer is correct, and useful. – vgoklani Jul 03 '17 at 05:35

3 Answers3

26

You can deal with this by inspecting the errors produced with BulkWriteError. This is actually an "object" which has several properties. The interesting parts are in details:

import pymongo
from bson.json_util import dumps
from pymongo import MongoClient
client = MongoClient()
db = client.test

collection = db.duptest

docs = [{ '_id': 1 }, { '_id': 1 },{ '_id': 2 }]


try:
  result = collection.insert_many(docs,ordered=False)

except pymongo.errors.BulkWriteError as e:
  print e.details['writeErrors']

On a first run, this will give the list of errors under e.details['writeErrors']:

[
  { 
    'index': 1,
    'code': 11000, 
    'errmsg': u'E11000 duplicate key error collection: test.duptest index: _id_ dup key: { : 1 }', 
    'op': {'_id': 1}
  }
]

On a second run, you see three errors because all items existed:

[
  {
    "index": 0,
    "code": 11000,
    "errmsg": "E11000 duplicate key error collection: test.duptest index: _id_ dup key: { : 1 }", 
    "op": {"_id": 1}
   }, 
   {
     "index": 1,
     "code": 11000,
     "errmsg": "E11000 duplicate key error collection: test.duptest index: _id_ dup key: { : 1 }",
     "op": {"_id": 1}
   },
   {
     "index": 2,
     "code": 11000,
     "errmsg": "E11000 duplicate key error collection: test.duptest index: _id_ dup key: { : 2 }",
     "op": {"_id": 2}
   }
]

So all you need do is filter the array for entries with "code": 11000 and then only "panic" when something else is in there

panic = filter(lambda x: x['code'] != 11000, e.details['writeErrors'])

if len(panic) > 0:
  print "really panic"

That gives you a mechanism for ignoring the duplicate key errors but of course paying attention to something that is actually a problem.

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
  • I did not know about the details field in the exception object, thanks! – vgoklani Jun 30 '17 at 04:49
  • 5
    @vgoklani It's kind of hidden, and not even really documented :( So even I had to go digging for it, even though I "knew it was there somewhere". Hence the delay since my last comments. – Neil Lunn Jun 30 '17 at 04:51
3

Adding more to Neil's solution.

Having 'ordered=False, bypass_document_validation=True' params allows new pending insertion to occur even on duplicate exception.

from pymongo import MongoClient, errors

DB_CLIENT = MongoClient()
MY_DB = DB_CLIENT['my_db']
TEST_COLL = MY_DB.dup_test_coll

doc_list = [
    {
        "_id": "82aced0eeab2467c93d04a9f72bf91e1",
        "name": "shakeel"
    },
    {
        "_id": "82aced0eeab2467c93d04a9f72bf91e1",  # duplicate error: 11000
        "name": "shakeel"
    },
    {
        "_id": "fab9816677774ca6ab6d86fc7b40dc62",  # this new doc gets inserted
        "name": "abc"
    }
]

try:
    # inserts new documents even on error
    TEST_COLL.insert_many(doc_list, ordered=False, bypass_document_validation=True)
except errors.BulkWriteError as e:
    print(f"Articles bulk insertion error {e}")

    panic_list = list(filter(lambda x: x['code'] != 11000, e.details['writeErrors']))
    if len(panic_list) > 0:
        print(f"these are not duplicate errors {panic_list}")

And since we are talking about duplicates its worth checking this solution as well.

Shakeel
  • 1,869
  • 15
  • 23
2

The correct solution is to use a WriteConcern with w=0 and ordered=False:

import pymongo
from pymongo.write_concern import WriteConcern


mongodb_connection[db][collection].with_options(write_concern=WriteConcern(w=0)).insert_many(messages, ordered=False)
vgoklani
  • 10,685
  • 16
  • 63
  • 101