Using a generator to iterate over a large collection in Mongo

Question

I have a collection with 500K+ documents which is stored on a single node mongo. Every now and then my pymongo cursor.find() fails as it times out.

While I could set the find to ignore timeout, I do not like that approach. Instead, I tried a generator (adapted from this answer and this link):

def mongo_iterator(self, cursor, limit=1000):
        skip = 0
        while True:
            results = cursor.find({}).sort("signature", 1).skip(skip).limit(limit)

            try:
                results.next()

            except StopIteration:
                break

            for result in results:
                yield result

            skip += limit

I then call this method using:

ref_results_iter = self.mongo_iterator(cursor=latest_rents_refs, limit=50000)
for ref in ref_results_iter:
    results_latest1.append(ref)

The problem: My iterator does not return the same number of results. The issue is that next() advances the cursor. So for every call I lose one element...

The question: Is there a way to adapt this code so that I can check if next exists? Pymongo 3x does not provide hasNext() and 'alive' check is not guaranteed to return false.

`0 to 1000` equal to `[0,1,2,3......,999]`, next start is`1000` but you will be lose one(probably last_one). So `index number never equal to length_number`. — dsgdfg, Sep 21 '16 at 06:00
Would it work to say `first_result_in_batch = results.next()`, thus capturing the element you are presently discarding (if any)? Then you would put `yield first_result_in_batch` above the for-loop, thus giving that element to the caller in the correct order. (I don't know MongoDB, so maybe I am missing something.) — D-Von, Sep 24 '16 at 15:45

score 3 · Answer 1 · answered Sep 21 '16 at 07:58

3

The .find() method takes additional keyword arguments. One of them is no_cursor_timeout which you need to set to True

cursor = collection.find({}, no_cursor_timeout=True)

You don't need to write your own generator function. The find() method returns a generator like object.

answered Sep 21 '16 at 07:58

styvane

59,869
19
150
156

setting timeout to False results in me having to bounce the VM...it just hangs. – zevij Sep 21 '16 at 22:41

score 1 · Answer 2 · answered Sep 21 '16 at 02:25

1

Why not use

for result in results:
    yield result

The for loop should handle StopIteration for you.

answered Sep 21 '16 at 02:25

Patrick Haugh

59,226
13
88
96

It does stop but then I need to know and handle the iterations & skips outside (e.g. fetch first 10,000, process, fetch next 10,000 process etc.). As I said, the problem is the 'stop' w/o losing data. – zevij Sep 21 '16 at 02:48
@dsgdfg you're missing the whole point. I agree you already have 1000 but due to next() you just dropped one. I guess the only way is to perform 'count' and take the skip/limit logic out of the iterator. – zevij Sep 21 '16 at 11:10
@Patrick Haugh - while I agree with your statement, what you are missing is that to have the correct iteration behaviour and one that guarantees you do not time out and that you do not 'hang' is to use skip and limit. The problem is that cursor.alive is not guaranteed to return False, hasNext() does not exist and if using next() then you drop an element....As stated above, the only way I can see this working is to take the skip/limit logic out of the iterator... – zevij Sep 21 '16 at 11:13
1

Mongo user never hold 1000 elements on cursor/RAM. Set all need value on write, mean work only last entry. Who told you `you can't save results to Mongodb` ? Waste resource and time. But maybe you want play with `real_time` data records, required set all `_id` to `time based id`. `find` method is garbage on mongodb. You try use a `Non Table DB` as classic db. Use another db system. Mongo user never use `find` function because **`db['I"am']['know']['where']['writing']['this']['data']`** always returned `data` or `None`. – dsgdfg Sep 22 '16 at 07:57

Using a generator to iterate over a large collection in Mongo

2 Answers2