5

I have a collection with 500K+ documents which is stored on a single node mongo. Every now and then my pymongo cursor.find() fails as it times out.

While I could set the find to ignore timeout, I do not like that approach. Instead, I tried a generator (adapted from this answer and this link):

def mongo_iterator(self, cursor, limit=1000):
        skip = 0
        while True:
            results = cursor.find({}).sort("signature", 1).skip(skip).limit(limit)

            try:
                results.next()

            except StopIteration:
                break

            for result in results:
                yield result

            skip += limit

I then call this method using:

ref_results_iter = self.mongo_iterator(cursor=latest_rents_refs, limit=50000)
for ref in ref_results_iter:
    results_latest1.append(ref)

The problem: My iterator does not return the same number of results. The issue is that next() advances the cursor. So for every call I lose one element...

The question: Is there a way to adapt this code so that I can check if next exists? Pymongo 3x does not provide hasNext() and 'alive' check is not guaranteed to return false.

Community
  • 1
  • 1
zevij
  • 2,416
  • 1
  • 23
  • 32
  • `0 to 1000` equal to `[0,1,2,3......,999]`, next start is`1000` but you will be lose one(probably last_one). So `index number never equal to length_number`. – dsgdfg Sep 21 '16 at 06:00
  • Would it work to say `first_result_in_batch = results.next()`, thus capturing the element you are presently discarding (if any)? Then you would put `yield first_result_in_batch` above the for-loop, thus giving that element to the caller in the correct order. (I don't know MongoDB, so maybe I am missing something.) – D-Von Sep 24 '16 at 15:45

2 Answers2

3

The .find() method takes additional keyword arguments. One of them is no_cursor_timeout which you need to set to True

cursor = collection.find({}, no_cursor_timeout=True)

You don't need to write your own generator function. The find() method returns a generator like object.

styvane
  • 59,869
  • 19
  • 150
  • 156
1

Why not use

for result in results:
    yield result

The for loop should handle StopIteration for you.

Patrick Haugh
  • 59,226
  • 13
  • 88
  • 96
  • It does stop but then I need to know and handle the iterations & skips outside (e.g. fetch first 10,000, process, fetch next 10,000 process etc.). As I said, the problem is the 'stop' w/o losing data. – zevij Sep 21 '16 at 02:48
  • @dsgdfg you're missing the whole point. I agree you already have 1000 but due to next() you just dropped one. I guess the only way is to perform 'count' and take the skip/limit logic out of the iterator. – zevij Sep 21 '16 at 11:10
  • @Patrick Haugh - while I agree with your statement, what you are missing is that to have the correct iteration behaviour and one that guarantees you do not time out and that you do not 'hang' is to use skip and limit. The problem is that cursor.alive is not guaranteed to return False, hasNext() does not exist and if using next() then you drop an element....As stated above, the only way I can see this working is to take the skip/limit logic out of the iterator... – zevij Sep 21 '16 at 11:13
  • 1
    Mongo user never hold 1000 elements on cursor/RAM. Set all need value on write, mean work only last entry. Who told you `you can't save results to Mongodb` ? Waste resource and time. But maybe you want play with `real_time` data records, required set all `_id` to `time based id`. `find` method is garbage on mongodb. You try use a `Non Table DB` as classic db. Use another db system. Mongo user never use `find` function because **`db['I"am']['know']['where']['writing']['this']['data']`** always returned `data` or `None`. – dsgdfg Sep 22 '16 at 07:57