PyMongo’s bulk write operation features with generators

Question

I would like to use PyMongo’s bulk write operation features which executes write operations in batches in order to reduces the number of network round trips and increaseses rite throughput.

I also found here that it was possible to used 5000 as a batch number.

However, I do not want is the best size for batch number and how to combine PyMongo’s bulk write operation features with generators in the following code?

from pymongo import MongoClient
from itertools import groupby
import csv


def iter_something(rows):
    key_names = ['type', 'name', 'sub_name', 'pos', 's_type', 'x_type']
    chr_key_names = ['letter', 'no']
    for keys, group in groupby(rows, lambda row: row[:6]):
        result = dict(zip(key_names, keys))
        result['chr'] = [dict(zip(chr_key_names, row[6:])) for row in group]
        yield result


def main():
    converters = [str, str, str, int, int, int, str, int]
    with open("/home/mic/tmp/test.txt") as c:
    reader = csv.reader(c, skipinitialspace=True)
    converted = ([conv(col) for conv, col in zip(converters, row)] for row in reader)
    for object_ in iter_something(converted):
        print(object_)


if __name__ == '__main__':
    db = MongoClient().test
    sDB = db.snps 
    main()

test.txt file:

  Test, A, B01, 828288,  1,    7, C, 5
  Test, A, B01, 828288,  1,    7, T, 6
  Test, A, B01, 171878,  3,    7, C, 5
  Test, A, B01, 171878,  3,    7, T, 6
  Test, A, B01, 871963,  3,    9, A, 5
  Test, A, B01, 871963,  3,    9, G, 6
  Test, A, B01, 1932523, 1,   10, T, 4
  Test, A, B01, 1932523, 1,   10, A, 5
  Test, A, B01, 1932523, 1,   10, X, 6
  Test, A, B01, 667214,  1,   14, T, 4
  Test, A, B01, 667214,  1,   14, G, 5
  Test, A, B01, 67214,   1,   14, G, 6

score 5 · Accepted Answer · answered Oct 27 '14 at 19:09

5

You can simply do:

sDB.insert(iter_something(converted))

PyMongo will do the right thing: iterate your generator until it has yielded 1000 a documents or 16MB of data, then pause the generator while it inserts the batch into MongoDB. Once the batch is inserted PyMongo resumes your generator to create the next batch, and continues until all documents are inserted. Then insert() returns a list of inserted document ids.

Initial support for generators was added to PyMongo in this commit and we've maintained support for document generators ever since.

answered Oct 27 '14 at 19:09

A. Jesse Jiryu Davis

23,641
4
57
70

[Here](http://stackoverflow.com/q/26601352/977828) I updated the code so it uses now multiprocessing, but I do not whether PyMongo still is able iterate the generator until it has yielded 1000 a documents or 16MB of data, then pause the generator while it inserts the batch into MongoDB. – user977828 Oct 28 '14 at 05:22
I doubt this will work, but I can't determine for certain without seeing your new code. In any case, this is a question and answer site, so I think you should ask a *new* question with the new code. – A. Jesse Jiryu Davis Oct 28 '14 at 13:04

score 1 · Answer 2 · edited May 23 '17 at 11:59

So you have a generator of documents, and you want to split it into chunks or groups of documents. This can be ellegantly done using the grouper generator, described in this answer.

Then, for each group of documents, use pymongo's insert to bulk-insert them.

You get:

def main():
    db = MongoClient().test
    sDB = db.snps 
    ...
    for docs_group in grouper(iter_something(converted), BULK_SIZE):
        docs_group = [ doc for doc in docs_group if doc is not None ]  # filter out Nones
        sDB.insert(docs_group, ...)

As to the optimal BULK_SIZE, that depends on various factors, e.g. typical document size, network lag, etc. You'd need to experiment with it.

PyMongo’s bulk write operation features with generators

2 Answers2

Linked