9

How do I bulk update/insert in mongoDb with pymongo/pandas. The error I get is batch op errors occurred I reason I get is because I set the "_id", which I want to do. I code runs fine on first run, but on second run it fails. I want to use pandas in workflow. The data does have a datetime object.

The syntax is completely different for upsert = True, with Update. An efficient solution with update would be helpful, where "_id" or "qid" could be set. But, there are python datetime objects!

InSQL   = 'SELECT * from  database2.table2 '
sqlOut  = pd.read_sql(InSQL,cxn)
sqlOut['_id'] = "20170101" + ":"+ sqlOut['Var']   

dfOut   = sqlOut.to_json(orient='records',date_format='iso' )
try:
    db["test"].insert_many(json.loads(dfOut))
except Exception as e:  print e

I have given a 50pt bounty, which expired, with no answer. Hmm...

Merlin
  • 24,552
  • 41
  • 131
  • 206
  • Mongo used `string base input` and not allow any variable inserting(operator only)[check this](http://stackoverflow.com/questions/2803852/python-date-string-to-date-object), `json.loads(dfOut)` you can't insert local variable as entry ! Primary key `YYYYmmDDHHMMSS`+`Counter value` – dsgdfg Mar 01 '17 at 07:16
  • You can dig into the `BulkWriteError` details and get a better idea of what's going on. It may be that your personally defined ids are duplicated or violate the 12-byte limit. See http://stackoverflow.com/questions/30355790/mongodb-bulk-write-error for details. – conner.xyz Mar 03 '17 at 15:09
  • The fact that the _id worked the first time, tells me that 12-byte limit is not the concern. Yes, the _id or gid are dups and needed for update. – Merlin Mar 03 '17 at 17:15
  • It sounds like your goal is to do an upsert but the current sample code is doing an insert (so it is expected that the successive runs will fail with duplicate `_id` keys). Can you update with a short example of the `dfOut` content and the upsert code you tried? Also, what version of PyMongo are you using? I suspect you actually want to use `update_many()` with the `upsert` option set. Can you also clarify the concerns on Python datetime objects and `_id` versus `qid`? Do you want to use the `qid` as the `_id`? – Stennie Apr 25 '17 at 22:18

2 Answers2

7

You get an error because you try to insert documents with fields which conflict with that of existing documents on the second and subsequent insert_many calls. You correctly inferred it may be due to your setting _id explicitly, which would then conflict with existing _id values in the collection.

MongoDB automatically creates an unique index on _id, which forbids duplicating values.

You need to update or replace your documents on calls after the first one (which inserted the documents in their first version). There is indeed a concept of "upsert" which will take care of inserting non-previously-existing documents in the collection as well as updating the existing ones.

Your options:

  • Most efficient: pymongo.collection.Collection.bulk_write

    import pymongo
    
    operations = [pymongo.operations.ReplaceOne(
        filter={"_id": doc["_id"]}, 
        replacement=doc,
        upsert=True
        ) for doc in json.loads(dfOut)]
    
    result = db["test"].bulk_write(operations)
    # handle results
    

Note that it's efficiency also depends on whether the field is indexed in the collection, which incidentally is the case for _id. (also see pymongo.operations.ReplaceOne)

Note: pymongo.collection.Collection.update_many seems unsuitable for your needs since you are not trying to set the same value on all matches of a given filter.

korrigan
  • 372
  • 5
  • 16
2

batch op error maybe caused by duplicate _id, So delete the same _id documents already in mongo before inserting

Or use update_many https://api.mongodb.com/python/current/api/pymongo/collection.html?highlight=update#pymongo.collection.Collection.update_many

https://docs.mongodb.com/manual/reference/method/db.collection.updateMany/

Kun
  • 91
  • 1
  • 5