1

I have a collection with about 500000 dataset in it and I like to find a random dataset out of it. I can restrict the find() to the customer-id, which reduces the size to about 80000 sets. Indices are also added to the customer-id.

In PHP I use the following command to get the random dataset:

 $mongoCursor = $mongoCollection->find($arrQuery, $arrFields)->skip(rand(1, $dataCount));

The profiler now tells:

 DB.Collection ntoskip:3224 nscanned:3326 nreturned:101 reslen:77979 262ms

This takes quite some time to fetch the result. Is there a better way to get the data?

I thought about fetching all ids in PHP, then randomly take one id and find the complete set for this id. But I worry about fetching so many data in php.

Thanks for any thought on that topic. Dan

thesonix
  • 3,200
  • 4
  • 20
  • 19
  • There *could* be a better way once there is enough demand for it... There is a [feature request to get random items from a collection](https://jira.mongodb.org/browse/SERVER-533) in the MongoDB ticket tracker. If implemented natively, it would likely be the most efficient option. (If you want the feature, go vote it up.) – David J. Jun 17 '12 at 02:34
  • This question has been asked in many forms here on Stack Overflow. The most popular question is [Random record from MongoDB](http://stackoverflow.com/questions/2824157/random-record-from-mongodb) -- it has good responses. That said, I think the best way of thinking about the question is not to think about getting one random document but, rather, randomizing a result set. See [Ordering a result set randomly in Mongo](http://stackoverflow.com/questions/8500266/ordering-a-result-set-randomly-in-mongo) for that. – David J. Jun 17 '12 at 02:42

2 Answers2

2

Skip forces Mongo to walk through the result set until it gets to the document you're looking for, so the bigger the result set of that query, the longer it's going to take.

What you really need for this use case is a way to randomly identify a document, not randomly query one. You could give each document an incremental identifier, then just randomly pick a number in that known range of ids until you find one that exists, but if you delete a lot of documents or need to apply a query that filters the possible matches, that range will be sparsely populated and it could end up taking even longer to find a result. It depends on your data and usage.

If this method won't work for your data and usage, you could also try the method discussed here: http://cookbook.mongodb.org/patterns/random-attribute/

The bottom line is that mongo won't do this for you, so it's really going to be up to you figuring out how to randomly identify a document in your data.

Tim Gautier
  • 29,150
  • 5
  • 46
  • 53
0

Hi I tried multiple solutions to the random problem. I used a cursor and moved it to the random position, but this was extremly slow. Then I used the full dataset and picked random items, which was okay but could be better.

The best performing solution for me was to pick random numbers, take the min and max value and query the database using:

db.collection.find({...}).skip(min).limit(max-min);

Then I just iterated once throught the result and comparing an index starting with i = min; i++; and taking just the item which matched a number in the random set. For me it was okay to limit the area of min and max randomly, too. I used a logarithmic approach to choose the size of the min-max window according to my collection size.

Result is a really fast way to pick random resultsets.

Hope this might help somebody too.

--- Dan

thesonix
  • 3,200
  • 4
  • 20
  • 19