0

I'm having a much more difficult time than I thought I would importing multiple documents from Mongo into RAM in batch. I am writing an application to communicate with a MongoDB via pymongo that currently has 2GBs, but in the near future could grow to over 1TB. Because of this, batch reading a limited number of records into RAM at a time is important for scalability.

Based on this post and this documentation I thought this would be about as easy as:

HOST = MongoClient(MONGO_CONN)
DB_CONN = HOST.database_name
collection = DB_CONN.collection_name
cursor = collection.find()
cursor.batch_size(1000) 
next_1K_records_in_RAM = cursor.next()

This isn't working for me, however. Even though I have a Mongo collection populated with >200K BSON objects, this reads them in one at a time as single dictionaries, e.g. {_id : ID1, ...} instead of what I'm looking for, which is an error of dictionaries representing multiple documents in my collections, e.g. [{_id : ID1, ...}, {_id : ID2, ...}, ..., {_id: ID1000, ...}].

I wouldn't expect this to matter, but I'm on python 3.5 instead of 2.7.

As this example references a secure, remote data source this isn't a reproducible example. Apologies for that. If you have a suggestion for how the question can be improved please let me know.

Community
  • 1
  • 1
aaron
  • 6,339
  • 12
  • 54
  • 80

1 Answers1

1
  • Python version is irrelevant here, nothing to do with your output.
  • Batch_size defines only how many documents mongoDB returns in a single trip to DB (under some limitations: see here here )
  • collection.find always returns an iterator/cursor or None. Batching does its job transparently) (the later if no documents are found)
  • To examine returned documents you have to iterate through the cursor i.e.

    For document in cursor: print (document)

    or if you want a list of the documents: list(cursor)

    • Remember to do a cursor.rewind() if you need to revisit the documents
nickmilon
  • 1,332
  • 1
  • 10
  • 9
  • OK, so if I want to create an iterator that returns 1000 docs at a time from the DB in my local RAM how do I do that? – aaron Sep 27 '16 at 18:11
  • collection.find({...}, limit=1000) – nickmilon Sep 28 '16 at 22:48
  • @nickmilon I suppose `limit=1000` will only return 1000 documents from the db (and then cursor will be exhausted). How can I iterate over the whole collection by the batches of 1000, so that `records.next()` will return lists with len = 1000? – mkurnikov Feb 07 '17 at 11:18
  • no mongoDB option for that, but you can handle it in your code a) use limit and repeat the find with a kind of paging mechanism in the query. b) l = list(db.collection.find({,,,}) sl = [l[x:x+1000] for x in range(0, len(l), 1000)] for i in sl:print(i) * but make sure your memory can handle those big lists – nickmilon Feb 09 '17 at 00:33