3

I have 75 million records in my MongoDB. I need to read the whole data in batches (of say 100,000), store it in some sort of stream / queue. Once the stream has data, a Python script will read from it and operate upon the data.

Basically I want the results of collection.find() to be served in batches of 100,000.

I know I can do collection.find()[0:100000] and then collection.find()[100000:200000] and so on as seen - here

But I'm worried about the efficiency of running 'skip' every time.

I'm aware of cursor.batchSize() but I'm not sure how to use that to sequentially keep reading data.

Is there any better way to do this?

Community
  • 1
  • 1
Mallika Khullar
  • 1,725
  • 3
  • 22
  • 37
  • 1
    Yes. Use `$gt` from the "Last Seen" `_id` value on the collection, presuming you are basically reading in that order. Otherwise it would be a similar "range" selection by whatever the sort order is, "plus" a `$nin` for excluding some of the "last seen ids" from the previous page if the "sorted" values are not unique. – Neil Lunn Jul 20 '17 at 12:16
  • 1
    I encountered similar problem and I founded this solution, may be if this can help: https://gist.github.com/ccca495816e5d5102e50a0429664ce34.git – Ceyhun Oct 09 '19 at 08:29

0 Answers0