2

I have a mongo collection with few million records, I need to iterate them for preprocessing and storing in a seperate collection. Would it be better to use a loop to iterate over a cursor limitted to small chunks or to iterate over a cursor without a limit defined.

from pymongo import MongoClient
from other_file import process

mc = MongoClient()
collection_obj = mc.mydb.mycoll
# method 1 
cursor1 = collection_obj.find({})
for each_ele in cursor1:
    process(each_ele)

# method 2
for i in range(0, total_length, 5000):
    cursor2 = collection_obj.find({}).limit(5000).skip(i)
    for each in cursor2:
        process(each)

Can I know which of these would be better for large data sets.

Kenstars
  • 662
  • 4
  • 11
  • Neither. Use an `$gte` with the last seen primary key or optionally a sort field and a list of the last primary key values for that last sort value. It's much faster than "skipping" through the results when all you want is "batches". – Neil Lunn Mar 29 '19 at 09:30

0 Answers0