24

I've recently started testing MongoDB via shell and via PyMongo. I've noticed that returning a cursor and trying to iterate over it seems to bottleneck in the actual iteration. Is there a way to return more than one document during iteration?

Pseudo code:

for line in file:
    value = line[a:b]
    cursor = collection.find({"field": value})
    for entry in cursor:
        (deal with single entry each time)

What I'm hoping to do is something like this:

for line in file
    value = line[a:b]
    cursor = collection.find({"field": value})
    for all_entries in cursor:
        (deal with all entries at once rather than iterate each time)

I've tried using batch_size() as per this question and changing the value all the way up to 1000000, but it doesn't seem to have any effect (or I'm doing it wrong).

Any help is greatly appreciated. Please be easy on this Mongo newbie!

--- EDIT ---

Thank you Caleb. I think you've pointed out what I was really trying to ask, which is this: is there any way to do a sort-of collection.findAll() or maybe cursor.fetchAll() command, as there is with the cx_Oracle module? The problem isn't storing the data, but retrieving it from the Mongo DB as fast as possible.

As far as I can tell, the speed at which the data is returned to me is dictated by my network since Mongo has to single-fetch each record, correct?

Community
  • 1
  • 1
Valdogg21
  • 1,151
  • 4
  • 14
  • 24
  • 2
    You can only return 1 record per iteration. Using the `batch_size` method tells the cursor internally how many records to fetch at once. So if the iteration (and not the fetching) is the bottle neck, you could try a list comprehension. I want to say there is an internal memory limit of 4MB in the cursor for the fetched records. – Uyghur Lives Matter Jul 13 '11 at 14:46
  • 1
    I have the exact same problem. I am new to mongo (and python for that matter). I think all the suggestions are essentially equivalent in the sense that those various python functions still interface with mongo in the exact same way causing the exact same outcome each time. Or, in other words, mongo cannot tell the difference between any of these approaches; as far as it's concerned, it did the find() request and then has the cursor requested "n" times. – Landon Oct 01 '12 at 18:44
  • @Valdog21, this was over a year ago, how did you eventually solve this? – Landon Oct 01 '12 at 18:44
  • Shortly after asking the question, we abandoned using MongoDB at all. I believe you're right in saying that Mongo essentially does fetch() N times, regardless of some settings, but I'm not exactly sure. Sorry I couldn't be of more help. – Valdogg21 Oct 08 '12 at 20:11

4 Answers4

17

Have you considered an approach like:

for line in file
  value = line[a:b]
  cursor = collection.find({"field": value})
  entries = cursor[:] # or pull them out with a loop or comprehension -- just get all the docs
  # then process entries as a list, either singly or in batch

Alternately, something like:

# same loop start
  entries[value] = cursor[:]
# after the loop, all the cursors are out of scope and closed
for value in entries:
  # process entries[value], either singly or in batch

Basically, as long as you have RAM enough to store your result sets, you should be able to pull them off the cursors and hold onto them before processing. This isn't likely to be significantly faster, but it will mitigate any slowdown specifically of the cursors, and free you to process your data in parallel if you're set up for that.

jmelesky
  • 361
  • 2
  • 3
  • 1
    Thank you! I'll try both suggestions, `entries = cursor[:]` and `entries = [entry for entry in cursor]`, against my original `for entry in cursor` method to test for performance. As I mentioned above in my edit, though, I believe the real problem is elsewhere. – Valdogg21 Jul 13 '11 at 18:01
15

You could also try:

results = list(collection.find({'field':value}))

That should load everything right into RAM.

Or this perhaps, if your file is not too huge:

values = list()
for line in file:
    values.append(line[a:b])
results = list(collection.find({'field': {'$in': values}}))
Isaac C.
  • 331
  • 2
  • 7
2

toArray() might be a solution. Based on the docs, it first iterates all over the cursors on Mongo and only returns the results once, in the form of an array.

http://docs.mongodb.org/manual/reference/method/cursor.toArray/

This is unlike list(coll.find()) or [doc for doc in coll.find()], which fetch one document to Python at a time and goes back to Mongo and fetch the next cursor.

However, this method was not implemented on pyMongo... strange

MK Yung
  • 4,344
  • 6
  • 30
  • 35
-1

Like mentioned above by @jmelesky, i always follow same kindof method. Here is my sample code. For storing my cursor twts_result, declaring list below to copy. Make use of RAM if you can to store the data. This solve cursor timeout problem if no processing and updation needded over your collection from where u fetched the data.

Here i am fetching tweets from collection.

twts_result = maindb.economy_geolocation.find({}, {'_id' : False})
print "Tweets for processing -> %d" %(twts_result.count())

tweets_sentiment = []
batch_tweets = []
#Copy the cursor data into list
tweets_collection = list(twts_result[:])
for twt in tweets_collection:
    #do stuff here with **twt** data
daemonsl
  • 442
  • 4
  • 6