Difficulty with document batch import, pymongo

Question

I'm having a much more difficult time than I thought I would importing multiple documents from Mongo into RAM in batch. I am writing an application to communicate with a MongoDB via pymongo that currently has 2GBs, but in the near future could grow to over 1TB. Because of this, batch reading a limited number of records into RAM at a time is important for scalability.

Based on this post and this documentation I thought this would be about as easy as:

HOST = MongoClient(MONGO_CONN)
DB_CONN = HOST.database_name
collection = DB_CONN.collection_name
cursor = collection.find()
cursor.batch_size(1000) 
next_1K_records_in_RAM = cursor.next()

This isn't working for me, however. Even though I have a Mongo collection populated with >200K BSON objects, this reads them in one at a time as single dictionaries, e.g. {_id : ID1, ...} instead of what I'm looking for, which is an error of dictionaries representing multiple documents in my collections, e.g. [{_id : ID1, ...}, {_id : ID2, ...}, ..., {_id: ID1000, ...}].

I wouldn't expect this to matter, but I'm on python 3.5 instead of 2.7.

As this example references a secure, remote data source this isn't a reproducible example. Apologies for that. If you have a suggestion for how the question can be improved please let me know.

nickmilon · Accepted Answer · 2016-09-26T23:46:29.153

1

Python version is irrelevant here, nothing to do with your output.
Batch_size defines only how many documents mongoDB returns in a single trip to DB (under some limitations: see here here )
collection.find always returns an iterator/cursor or None. Batching does its job transparently) (the later if no documents are found)
To examine returned documents you have to iterate through the cursor i.e.

For document in cursor: print (document)

or if you want a list of the documents: list(cursor)
- Remember to do a cursor.rewind() if you need to revisit the documents

edited Sep 26 '16 at 23:46

answered Sep 26 '16 at 23:41

nickmilon

1,332
1
10
9

OK, so if I want to create an iterator that returns 1000 docs at a time from the DB in my local RAM how do I do that? – aaron Sep 27 '16 at 18:11
collection.find({...}, limit=1000) – nickmilon Sep 28 '16 at 22:48
@nickmilon I suppose `limit=1000` will only return 1000 documents from the db (and then cursor will be exhausted). How can I iterate over the whole collection by the batches of 1000, so that `records.next()` will return lists with len = 1000? – mkurnikov Feb 07 '17 at 11:18
no mongoDB option for that, but you can handle it in your code a) use limit and repeat the find with a kind of paging mechanism in the query. b) l = list(db.collection.find({,,,}) sl = [l[x:x+1000] for x in range(0, len(l), 1000)] for i in sl:print(i) * but make sure your memory can handle those big lists – nickmilon Feb 09 '17 at 00:33

Difficulty with document batch import, pymongo

1 Answers1