443

I am looking to get a random record from a huge collection (100 million records).

What is the fastest and most efficient way to do so?

The data is already there and there are no field in which I can generate a random number and obtain a random row.

Liam
  • 27,717
  • 28
  • 128
  • 190
Will M
  • 4,431
  • 3
  • 16
  • 3
  • 2
    See also this [SO question titled "Ordering a result set randomly in mongo"](http://stackoverflow.com/questions/8500266/ordering-a-result-set-randomly-in-mongo). Thinking about randomly ordering a result set is a more general version of this question -- more powerful and more useful. – David J. Jun 15 '12 at 20:30
  • 14
    This question keeps popping up. The latest information can likely be found at the [feature request to get random items from a collection](https://jira.mongodb.org/browse/SERVER-533) in the MongoDB ticket tracker. If implemented natively, it would likely be the most efficient option. (If you want the feature, go vote it up.) – David J. Jun 17 '12 at 02:37
  • Is this a sharded collection? – Dylan Tong Jul 27 '13 at 17:51
  • Does anyone know how much slower this is than just taking the first record? I’m debating whether it’s worth taking a random sample to do something vs just doing it in order. – David Kong Feb 06 '20 at 15:00
  • 1
    Actually opposite of the answers $sample might not be fastest solution. Because mongo may do a collection scan for random sorting when using $sample depending on the situation. Please see: Reference: https://docs.mongodb.com/manual/reference/operator/aggregation/sample/ Maybe doing counting result set and doing some random skip take will do better. – cahit beyaz Dec 05 '20 at 07:51

30 Answers30

396

Starting with the 3.2 release of MongoDB, you can get N random docs from a collection using the $sample aggregation pipeline operator:

// Get one random document from the mycoll collection.
db.mycoll.aggregate([{ $sample: { size: 1 } }])

If you want to select the random document(s) from a filtered subset of the collection, prepend a $match stage to the pipeline:

// Get one random document matching {a: 10} from the mycoll collection.
db.mycoll.aggregate([
    { $match: { a: 10 } },
    { $sample: { size: 1 } }
])

As noted in the comments, when size is greater than 1, there may be duplicates in the returned document sample.

JohnnyHK
  • 305,182
  • 66
  • 621
  • 471
  • 26
    This is a good way, but remember that it DO NOT guarantee that there are no copies of the same object in the sample. – Matheus Araujo Jan 06 '16 at 01:28
  • 18
    @MatheusAraujo which won't matter if you want one record but good point anyway – Toby Jan 10 '16 at 03:35
  • 1
    @dalanmiller it's only the *right* answer if you're using 3.2+, otherwise it is the wrong answer. – BanksySan Apr 07 '16 at 14:15
  • 3
    Not to be pedantic but the question doesn't specify a MongoDB version, so I'd assume having the most recent version is reasonable. – dalanmiller Apr 07 '16 at 17:35
  • What's the computational complexity/cost of this? – Nepoxx May 17 '16 at 15:14
  • 2
    @Nepoxx See [the docs](https://docs.mongodb.com/manual/reference/operator/aggregation/sample/#behavior) regarding the processing involved. – JohnnyHK Jun 07 '16 at 13:32
  • 1
    There is some kind of issue, as per this question: http://stackoverflow.com/questions/37679999/mongodb-aggregation-with-sample-very-slow – Steve Rossiter Jun 07 '16 at 13:42
  • 1
    @Nepoxx ran a profiler to have an idea. 100,000 random documents, the method took 0.155sec to sample 10 documents. – Dorival Jun 08 '16 at 15:28
  • 1
    @MatheusAraujo it's pretty easy to filter the duplicate afterwards – Hai Phaikawl Sep 17 '17 at 14:08
  • .aggregate( [ { $sample: { size: 1 } } ] ) - this syntax worked for me in mongo 3.4 – Tebe Oct 04 '18 at 15:53
  • @Tebe Good point, the array pipeline is required now. Updated. – JohnnyHK Oct 04 '18 at 15:56
  • now the answer won't work for guys with mongo 3.2 and possibly lower ) – Tebe Oct 04 '18 at 15:57
  • @Tebe No, the array syntax was always supported, it just used to be optional. – JohnnyHK Oct 04 '18 at 15:58
  • 1
    A useful addition is to mention that to limit your search according to certain key value pairs you can do: `db.mycoll.aggregate([{ $sample: { size: 1 } }, { $match: {key1: value1, key2: value2, ...}}])` – ThisIsNotAnId Feb 22 '19 at 18:58
  • @ThisIsNotAnId True, but you'd typically want to put the `$match` stage first. – JohnnyHK Feb 22 '19 at 19:06
  • @JohnnyHK Right, I realized that later. Total rookie mistake here. – ThisIsNotAnId Feb 22 '19 at 23:11
  • @HaiPhaikawl how can we remove duplicates from the results of $sample and $match stages($match being the first stage)? – shivgre Apr 16 '19 at 09:33
  • It should sample before matching for optimal performance. See article here http://danielhnyk.cz/randomly-choose-n-records-from-mongodb/ – brycejl Apr 18 '20 at 22:59
  • 2
    @brycejl That would have the fatal flaw of not matching anything if the $sample stage didn't select any matching documents. – JohnnyHK Apr 19 '20 at 00:21
  • This one is too slow! with sample 0.2s. without 0.012s – John Tribe Jul 10 '20 at 06:37
  • I have written a package https://www.npmjs.com/package/unique-random-docs that fix duplication issues and return only unique docs npm i unique-random-docs feel free to check it out! – Detoner Mar 01 '21 at 10:13
  • 1
    For MMAPv1 storage engine, $sample may return the same document more than once in the result set. For WiredTiger or in-memory storage engine, $sample does not return duplicate documents. WiredTiger is the default storage engine as of MongoDB 3.2. – Mateen Kiani Dec 24 '21 at 16:55
123

Do a count of all records, generate a random number between 0 and the count, and then do:

db.yourCollection.find().limit(-1).skip(yourRandomNumber).next()
abraham
  • 46,583
  • 10
  • 100
  • 152
ceejayoz
  • 176,543
  • 40
  • 303
  • 368
  • 165
    Unfortunately skip() is rather inefficient since it has to scan that many documents. Also, there is a race condition if rows are removed between getting the count and running the query. – mstearn May 17 '10 at 18:49
  • 7
    Note that the random number should be between 0 and the count (exclusive). I.e., if you have 10 items, the random number should be between 0 and 9. Otherwise the cursor could try to skip past the last item, and nothing would be returned. – matt Apr 20 '11 at 22:05
  • 4
    Thanks, worked perfectly for my purposes. @mstearn, your comments on both efficiency and race conditions are valid, but for collections where neither matters (one-time server-side batch extract in a collection where records aren't deleted), this is vastly superior to the hacky (IMO) solution in the Mongo Cookbook. – Michael Moussa Sep 05 '12 at 16:27
  • 6
    what does setting the limit to -1 do? – MonkeyBonkey Jan 27 '13 at 12:46
  • @MonkeyBonkey http://docs.mongodb.org/meta-driver/latest/legacy/mongodb-wire-protocol/ "If numberToReturn is 0, the db will use the default return size. If the number is negative, then the database will return that number and close the cursor." – ceejayoz Jan 27 '13 at 15:24
  • This does seem like the most viable solution for equally distributed random selection. The other approach would be to generate a sequential ID for each of the documents - perhaps as another collection so it's not updating the original document, then using a random number in the number of documents present to select one. Of course, this could quickly get out of date, but would be useable for selecting multiple documents randomly by generating the list once and giving a matrix of numbers to get object IDs for. – David Burton Jul 11 '13 at 10:46
  • If you need greater efficiency than this then a modification of the cookbook recipe can yield evenly distributed random records. See my answer below. – spam_eggs Feb 09 '14 at 18:08
  • For anyone wondering how to count all records, see [the `count` command](http://docs.mongodb.org/manual/reference/command/count/). – Boaz Feb 10 '15 at 22:36
  • If your collection is equally distributed over time, you can use a timestamp to efficiently pick a random record. See http://stackoverflow.com/questions/2824157/random-record-from-mongodb/27306249#answer-27306249. – Martin Nowak Mar 31 '15 at 18:17
92

Update for MongoDB 3.2

3.2 introduced $sample to the aggregation pipeline.

There's also a good blog post on putting it into practice.

For older versions (previous answer)

This was actually a feature request: http://jira.mongodb.org/browse/SERVER-533 but it was filed under "Won't fix."

The cookbook has a very good recipe to select a random document out of a collection: http://cookbook.mongodb.org/patterns/random-attribute/

To paraphrase the recipe, you assign random numbers to your documents:

db.docs.save( { key : 1, ..., random : Math.random() } )

Then select a random document:

rand = Math.random()
result = db.docs.findOne( { key : 2, random : { $gte : rand } } )
if ( result == null ) {
  result = db.docs.findOne( { key : 2, random : { $lte : rand } } )
}

Querying with both $gte and $lte is necessary to find the document with a random number nearest rand.

And of course you'll want to index on the random field:

db.docs.ensureIndex( { key : 1, random :1 } )

If you're already querying against an index, simply drop it, append random: 1 to it, and add it again.

Giacomo1968
  • 25,759
  • 11
  • 71
  • 103
Michael
  • 1,819
  • 14
  • 4
  • 7
    And here is a simple way to add the random field to every document in the collection. function setRandom() { db.topics.find().forEach(function (obj) {obj.random = Math.random();db.topics.save(obj);}); } db.eval(setRandom); – Geoffrey Jun 01 '11 at 01:18
  • The feature request has been reopened, but is not yet scheduled. – Leopd Oct 28 '11 at 18:48
  • 9
    This selects a document randomly, but if you do it more than once, the lookups are not independent. You are more likely to get the same document twice in a row than random chance would dictate. – lacker Jan 10 '12 at 02:19
  • 12
    Looks like a bad implementation of circular hashing. It's even worse than lacker says: even one lookup is biased because the random numbers aren't evenly distributed. To do this properly, you'd need a set of, say, 10 random numbers per document. The more random numbers you use per document, the more uniform the output distribution becomes. – Thomas Mar 29 '12 at 21:11
  • 4
    The MongoDB JIRA ticket is still alive: https://jira.mongodb.org/browse/SERVER-533 Go comment and vote if you want the feature. – David J. Jun 15 '12 at 20:32
  • 1
    Take note the type of caveat mentioned. This does not work efficiently with small amount of documents. Given two items with random key of 3 and 63. The document #63 will be chosen more frequently where `$gte` is first. Alternative solution http://stackoverflow.com/a/9499484/79201 would work better in this case. – Ryan Schumacher Oct 30 '13 at 15:50
  • The bias can be eliminated by generating new random numbers as you go. I will post an answer describing this in more detail. – spam_eggs Feb 07 '14 at 17:19
  • 1
    If, for example, first document of your collection has random=0.8 then random: { $gte: rand } will return this first document for all random values <= 0.8. In fact this is a terrible solution, I am wondering why it is so popular on the internet. – Anton Petrov Mar 27 '14 at 10:45
  • if you can't assure that you have MANY documents and an even distribution, this is a very bad solution, since it tends to give the same document often. – pomarc Apr 19 '14 at 13:08
  • since map reduce sorts input by key, one could use this behavior to get the closest result by simply selecting .first from results (or .last in case of lte(param) – mmln May 05 '14 at 16:21
  • To greatly uniform the distribution of results, you can use `findAndModify()` and update the random field along with each query. – Julien Oct 05 '14 at 16:58
  • If you are willing to make two queries and an update, you can fix the randomness issue by selecting both records that are $gte and $lte. Then return the record that is closest to the random value. Then update the record to a have a new random. – diedthreetimes Nov 24 '14 at 01:10
  • 1
    Looks like the feature request was acknowledged and fixed very recently (2015/10, in version 3.1.6). You can consider update your answer. :) – grapeot Nov 11 '15 at 00:35
  • But when I do that my articles are gone – Alp Eren Gül May 31 '20 at 21:56
57

You can also use MongoDB's geospatial indexing feature to select the documents 'nearest' to a random number.

First, enable geospatial indexing on a collection:

db.docs.ensureIndex( { random_point: '2d' } )

To create a bunch of documents with random points on the X-axis:

for ( i = 0; i < 10; ++i ) {
    db.docs.insert( { key: i, random_point: [Math.random(), 0] } );
}

Then you can get a random document from the collection like this:

db.docs.findOne( { random_point : { $near : [Math.random(), 0] } } )

Or you can retrieve several document nearest to a random point:

db.docs.find( { random_point : { $near : [Math.random(), 0] } } ).limit( 4 )

This requires only one query and no null checks, plus the code is clean, simple and flexible. You could even use the Y-axis of the geopoint to add a second randomness dimension to your query.

Nico de Poel
  • 772
  • 5
  • 4
  • 8
    I like this answer, Its the most efficient one I've seen that doesn't require a bunch of messing about server side. – Tony Million Mar 10 '12 at 17:58
  • 4
    This is also biased towards documents that happen to have few points in their vicinity. – Thomas Mar 29 '12 at 21:13
  • 6
    That is true, and there are other problems as well: documents are strongly correlated on their random keys, so it's highly predictable which documents will be returned as a group if you select multiple documents. Also, documents close to the bounds (0 and 1) are less likely to be chosen. The latter could be solved by using spherical geomapping, which wraps around at the edges. However, you should see this answer as an improved version of the cookbook recipe, not as a perfect random selection mechanism. It's random enough for most purposes. – Nico de Poel Mar 30 '12 at 11:51
  • @NicodePoel, I like your answer as well as your comment! And I have a couple of questions for you: 1- How do you know that points close to bounds 0 and 1 are less likely to be chosen, is that based on some mathematical ground?, 2- Can you elaborate more on spherical geomapping, how it will better the random selection, and how to do it in MongoDB? ... Appreciated! – securecurve Sep 10 '15 at 12:47
  • Apprichiate your idea. Finally, I have a great code that is much CPU & RAM friendly! Thank you – Qais Bsharat Mar 03 '20 at 22:49
21

The following recipe is a little slower than the mongo cookbook solution (add a random key on every document), but returns more evenly distributed random documents. It's a little less-evenly distributed than the skip( random ) solution, but much faster and more fail-safe in case documents are removed.

function draw(collection, query) {
    // query: mongodb query object (optional)
    var query = query || { };
    query['random'] = { $lte: Math.random() };
    var cur = collection.find(query).sort({ rand: -1 });
    if (! cur.hasNext()) {
        delete query.random;
        cur = collection.find(query).sort({ rand: -1 });
    }
    var doc = cur.next();
    doc.random = Math.random();
    collection.update({ _id: doc._id }, doc);
    return doc;
}

It also requires you to add a random "random" field to your documents so don't forget to add this when you create them : you may need to initialize your collection as shown by Geoffrey

function addRandom(collection) { 
    collection.find().forEach(function (obj) {
        obj.random = Math.random();
        collection.save(obj);
    }); 
} 
db.eval(addRandom, db.things);

Benchmark results

This method is much faster than the skip() method (of ceejayoz) and generates more uniformly random documents than the "cookbook" method reported by Michael:

For a collection with 1,000,000 elements:

  • This method takes less than a millisecond on my machine

  • the skip() method takes 180 ms on average

The cookbook method will cause large numbers of documents to never get picked because their random number does not favor them.

  • This method will pick all elements evenly over time.

  • In my benchmark it was only 30% slower than the cookbook method.

  • the randomness is not 100% perfect but it is very good (and it can be improved if necessary)

This recipe is not perfect - the perfect solution would be a built-in feature as others have noted.
However it should be a good compromise for many purposes.

colllin
  • 9,442
  • 9
  • 49
  • 65
spam_eggs
  • 1,058
  • 6
  • 11
12

Here is a way using the default ObjectId values for _id and a little math and logic.

// Get the "min" and "max" timestamp values from the _id in the collection and the 
// diff between.
// 4-bytes from a hex string is 8 characters

var min = parseInt(db.collection.find()
        .sort({ "_id": 1 }).limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
    max = parseInt(db.collection.find()
        .sort({ "_id": -1 })limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
    diff = max - min;

// Get a random value from diff and divide/multiply be 1000 for The "_id" precision:
var random = Math.floor(Math.floor(Math.random(diff)*diff)/1000)*1000;

// Use "random" in the range and pad the hex string to a valid ObjectId
var _id = new ObjectId(((min + random)/1000).toString(16) + "0000000000000000")

// Then query for the single document:
var randomDoc = db.collection.find({ "_id": { "$gte": _id } })
   .sort({ "_id": 1 }).limit(1).toArray()[0];

That's the general logic in shell representation and easily adaptable.

So in points:

  • Find the min and max primary key values in the collection

  • Generate a random number that falls between the timestamps of those documents.

  • Add the random number to the minimum value and find the first document that is greater than or equal to that value.

This uses "padding" from the timestamp value in "hex" to form a valid ObjectId value since that is what we are looking for. Using integers as the _id value is essentially simplier but the same basic idea in the points.

Blakes Seven
  • 49,422
  • 14
  • 129
  • 135
  • I have a collection of 300 000 000 lines. This is the only solution that works and it's fast enough. – Nikos Apr 14 '19 at 06:51
11

Now you can use the aggregate. Example:

db.users.aggregate(
   [ { $sample: { size: 3 } } ]
)

See the doc.

dbam
  • 1,088
  • 1
  • 13
  • 16
8

In Python using pymongo:

import random

def get_random_doc():
    count = collection.count()
    return collection.find()[random.randrange(count)]
Jabba
  • 19,598
  • 6
  • 52
  • 45
  • 5
    Worth noting that internally, this will use skip and limit, just like many of the other answers. – JohnnyHK Jan 24 '15 at 15:07
  • Your answer is correct. However, please replace `count()`with `estimated_document_count()` as `count()` is deprecated in Mongdo v4.2. – user3848207 Jun 11 '20 at 23:50
8

Using Python (pymongo), the aggregate function also works.

collection.aggregate([{'$sample': {'size': sample_size }}])

This approach is a lot faster than running a query for a random number (e.g. collection.find([random_int]). This is especially the case for large collections.

Daniel
  • 473
  • 4
  • 9
7

it is tough if there is no data there to key off of. what are the _id field? are they mongodb object id's? If so, you could get the highest and lowest values:

lowest = db.coll.find().sort({_id:1}).limit(1).next()._id;
highest = db.coll.find().sort({_id:-1}).limit(1).next()._id;

then if you assume the id's are uniformly distributed (but they aren't, but at least it's a start):

unsigned long long L = first_8_bytes_of(lowest)
unsigned long long H = first_8_bytes_of(highest)

V = (H - L) * random_from_0_to_1();
N = L + V;
oid = N concat random_4_bytes();

randomobj = db.coll.find({_id:{$gte:oid}}).limit(1);
dm.
  • 1,982
  • 12
  • 7
  • 1
    Any ideas how would that look like in PHP? or at least what language have you used above? is it Python? – Marcin May 20 '13 at 18:03
5

You can pick a random timestamp and search for the first object that was created afterwards. It will only scan a single document, though it doesn't necessarily give you a uniform distribution.

var randRec = function() {
    // replace with your collection
    var coll = db.collection
    // get unixtime of first and last record
    var min = coll.find().sort({_id: 1}).limit(1)[0]._id.getTimestamp() - 0;
    var max = coll.find().sort({_id: -1}).limit(1)[0]._id.getTimestamp() - 0;

    // allow to pass additional query params
    return function(query) {
        if (typeof query === 'undefined') query = {}
        var randTime = Math.round(Math.random() * (max - min)) + min;
        var hexSeconds = Math.floor(randTime / 1000).toString(16);
        var id = ObjectId(hexSeconds + "0000000000000000");
        query._id = {$gte: id}
        return coll.find(query).limit(1)
    };
}();
Martin Nowak
  • 1,282
  • 13
  • 8
  • It would be easily possible to skew the random date to account for superlinear database growth. – Martin Nowak Mar 31 '15 at 18:20
  • this is the best method for very large collections, it works at O(1), unline skip() or count() used in the other solutions here – marmor Nov 02 '16 at 09:04
4

My solution on php:

/**
 * Get random docs from Mongo
 * @param $collection
 * @param $where
 * @param $fields
 * @param $limit
 * @author happy-code
 * @url happy-code.com
 */
private function _mongodb_get_random (MongoCollection $collection, $where = array(), $fields = array(), $limit = false) {

    // Total docs
    $count = $collection->find($where, $fields)->count();

    if (!$limit) {
        // Get all docs
        $limit = $count;
    }

    $data = array();
    for( $i = 0; $i < $limit; $i++ ) {

        // Skip documents
        $skip = rand(0, ($count-1) );
        if ($skip !== 0) {
            $doc = $collection->find($where, $fields)->skip($skip)->limit(1)->getNext();
        } else {
            $doc = $collection->find($where, $fields)->limit(1)->getNext();
        }

        if (is_array($doc)) {
            // Catch document
            $data[ $doc['_id']->{'$id'} ] = $doc;
            // Ignore current document when making the next iteration
            $where['_id']['$nin'][] = $doc['_id'];
        }

        // Every iteration catch document and decrease in the total number of document
        $count--;

    }

    return $data;
}
4

The best way in Mongoose is to make an aggregation call with $sample. However, Mongoose does not apply Mongoose documents to Aggregation - especially not if populate() is to be applied as well.

For getting a "lean" array from the database:

/*
Sample model should be init first
const Sample = mongoose …
*/

const samples = await Sample.aggregate([
  { $match: {} },
  { $sample: { size: 33 } },
]).exec();
console.log(samples); //a lean Array

For getting an array of mongoose documents:

const samples = (
  await Sample.aggregate([
    { $match: {} },
    { $sample: { size: 27 } },
    { $project: { _id: 1 } },
  ]).exec()
).map(v => v._id);

const mongooseSamples = await Sample.find({ _id: { $in: samples } });

console.log(mongooseSamples); //an Array of mongoose documents
TG___
  • 481
  • 4
  • 6
3

In order to get a determinated number of random docs without duplicates:

  1. first get all ids
  2. get size of documents
  3. loop geting random index and skip duplicated

    number_of_docs=7
    db.collection('preguntas').find({},{_id:1}).toArray(function(err, arr) {
    count=arr.length
    idsram=[]
    rans=[]
    while(number_of_docs!=0){
        var R = Math.floor(Math.random() * count);
        if (rans.indexOf(R) > -1) {
         continue
          } else {           
                   ans.push(R)
                   idsram.push(arr[R]._id)
                   number_of_docs--
                    }
        }
    db.collection('preguntas').find({}).toArray(function(err1, doc1) {
                    if (err1) { console.log(err1); return;  }
                   res.send(doc1)
                });
            });
    
anonymous255
  • 108
  • 10
Fabio Guerra
  • 722
  • 6
  • 13
2

You can pick random _id and return corresponding object:

 db.collection.count( function(err, count){
        db.collection.distinct( "_id" , function( err, result) {
            if (err)
                res.send(err)
            var randomId = result[Math.floor(Math.random() * (count-1))]
            db.collection.findOne( { _id: randomId } , function( err, result) {
                if (err)
                    res.send(err)
                console.log(result)
            })
        })
    })

Here you dont need to spend space on storing random numbers in collection.

Vijay13
  • 525
  • 2
  • 8
  • 18
2

The following aggregation operation randomly selects 3 documents from the collection:

db.users.aggregate( [ { $sample: { size: 3 } } ] )

https://docs.mongodb.com/manual/reference/operator/aggregation/sample/

Anup Panwar
  • 293
  • 4
  • 11
2

MongoDB now has $rand

To pick n non repeat items, aggregate with { $addFields: { _f: { $rand: {} } } } then $sort by _f and $limit n.

Polv
  • 1,918
  • 1
  • 20
  • 31
2

My simplest solution to this ...

db.coll.find()
    .limit(1)
    .skip(Math.floor(Math.random() * 500))
    .next()

Where you have at least 500 items on collections

Irfan Habib
  • 148
  • 1
  • 1
  • 8
2

I would suggest using map/reduce, where you use the map function to only emit when a random value is above a given probability.

function mapf() {
    if(Math.random() <= probability) {
    emit(1, this);
    }
}

function reducef(key,values) {
    return {"documents": values};
}

res = db.questions.mapReduce(mapf, reducef, {"out": {"inline": 1}, "scope": { "probability": 0.5}});
printjson(res.results);

The reducef function above works because only one key ('1') is emitted from the map function.

The value of the "probability" is defined in the "scope", when invoking mapRreduce(...)

Using mapReduce like this should also be usable on a sharded db.

If you want to select exactly n of m documents from the db, you could do it like this:

function mapf() {
    if(countSubset == 0) return;
    var prob = countSubset / countTotal;
    if(Math.random() <= prob) {
        emit(1, {"documents": [this]}); 
        countSubset--;
    }
    countTotal--;
}

function reducef(key,values) {
    var newArray = new Array();
for(var i=0; i < values.length; i++) {
    newArray = newArray.concat(values[i].documents);
}

return {"documents": newArray};
}

res = db.questions.mapReduce(mapf, reducef, {"out": {"inline": 1}, "scope": {"countTotal": 4, "countSubset": 2}})
printjson(res.results);

Where "countTotal" (m) is the number of documents in the db, and "countSubset" (n) is the number of documents to retrieve.

This approach might give some problems on sharded databases.

torbenl
  • 296
  • 2
  • 6
  • 4
    Doing a full collection scan to return 1 element... this must be the least efficient technique to do it. – Thomas Mar 29 '12 at 21:14
  • 1
    The trick is, that it is a general solution for returning an arbitrary number of random elements - in which case it would be faster than the other solutions when getting > 2 random elements. – torbenl Feb 06 '14 at 10:52
1

When I was faced with a similar solution, I backtracked and found that the business request was actually for creating some form of rotation of the inventory being presented. In that case, there are much better options, which have answers from search engines like Solr, not data stores like MongoDB.

In short, with the requirement to "intelligently rotate" content, what we should do instead of a random number across all of the documents is to include a personal q score modifier. To implement this yourself, assuming a small population of users, you can store a document per user that has the productId, impression count, click-through count, last seen date, and whatever other factors the business finds as being meaningful to compute a q score modifier. When retrieving the set to display, typically you request more documents from the data store than requested by the end user, then apply the q score modifier, take the number of records requested by the end user, then randomize the page of results, a tiny set, so simply sort the documents in the application layer (in memory).

If the universe of users is too large, you can categorize users into behavior groups and index by behavior group rather than user.

If the universe of products is small enough, you can create an index per user.

I have found this technique to be much more efficient, but more importantly more effective in creating a relevant, worthwhile experience of using the software solution.

paegun
  • 705
  • 6
  • 8
1

non of the solutions worked well for me. especially when there are many gaps and set is small. this worked very well for me(in php):

$count = $collection->count($search);
$skip = mt_rand(0, $count - 1);
$result = $collection->find($search)->skip($skip)->limit(1)->getNext();
Mantas
  • 222
  • 1
  • 7
  • You specify the language, but not the library you're using? – BenMorel Jan 21 '14 at 18:28
  • FYI, there is a race condition here if a document is removed between the first and third line. Also `find` + `skip` is pretty bad, you are returning all documents just to choose one :S. – Martin Konecny Jul 28 '14 at 03:33
  • find() should return only a cursor, so it wouldn't return the all actual documents. BUT yes, this compromise lose the performance x 1000000 times in my test ;) – kakadais Oct 14 '21 at 00:16
1

I'd suggest adding a random int field to each object. Then you can just do a

findOne({random_field: {$gte: rand()}}) 

to pick a random document. Just make sure you ensureIndex({random_field:1})

om-nom-nom
  • 62,329
  • 13
  • 183
  • 228
mstearn
  • 4,156
  • 1
  • 19
  • 18
  • 2
    If the first record in your collection has a relatively high random_field value, won't it be returned almost all the time? – thehiatus Jan 23 '13 at 23:03
  • 2
    thehaitus is correct, it will -- it is not suitable for any purpose – Heptic Aug 07 '13 at 21:54
  • 7
    This solution is completely wrong, adding a random number (let's imagine in between 0 a 2^32-1) doesn't guarantee any good distribution and using $gte makes it even worst, due to your random selection won't be even close to a pseudo-random number. I suggest not to use this concept ever. – Maximiliano Rios Dec 02 '13 at 20:32
1

My PHP/MongoDB sort/order by RANDOM solution. Hope this helps anyone.

Note: I have numeric ID's within my MongoDB collection that refer to a MySQL database record.

First I create an array with 10 randomly generated numbers

    $randomNumbers = [];
    for($i = 0; $i < 10; $i++){
        $randomNumbers[] = rand(0,1000);
    }

In my aggregation I use the $addField pipeline operator combined with $arrayElemAt and $mod (modulus). The modulus operator will give me a number from 0 - 9 which I then use to pick a number from the array with random generated numbers.

    $aggregate[] = [
        '$addFields' => [
            'random_sort' => [ '$arrayElemAt' => [ $randomNumbers, [ '$mod' => [ '$my_numeric_mysql_id', 10 ] ] ] ],
        ],
    ];

After that you can use the sort Pipeline.

    $aggregate[] = [
        '$sort' => [
            'random_sort' => 1
        ]
    ];
feskr
  • 759
  • 6
  • 10
0

If you have a simple id key, you could store all the id's in an array, and then pick a random id. (Ruby answer):

ids = @coll.find({},fields:{_id:1}).to_a
@coll.find(ids.sample).first
Mr. Demetrius Michael
  • 2,326
  • 5
  • 28
  • 40
0

Using Map/Reduce, you can certainly get a random record, just not necessarily very efficiently depending on the size of the resulting filtered collection you end up working with.

I've tested this method with 50,000 documents (the filter reduces it to about 30,000), and it executes in approximately 400ms on an Intel i3 with 16GB ram and a SATA3 HDD...

db.toc_content.mapReduce(
    /* map function */
    function() { emit( 1, this._id ); },

    /* reduce function */
    function(k,v) {
        var r = Math.floor((Math.random()*v.length));
        return v[r];
    },

    /* options */
    {
        out: { inline: 1 },
        /* Filter the collection to "A"ctive documents */
        query: { status: "A" }
    }
);

The Map function simply creates an array of the id's of all documents that match the query. In my case I tested this with approximately 30,000 out of the 50,000 possible documents.

The Reduce function simply picks a random integer between 0 and the number of items (-1) in the array, and then returns that _id from the array.

400ms sounds like a long time, and it really is, if you had fifty million records instead of fifty thousand, this may increase the overhead to the point where it becomes unusable in multi-user situations.

There is an open issue for MongoDB to include this feature in the core... https://jira.mongodb.org/browse/SERVER-533

If this "random" selection was built into an index-lookup instead of collecting ids into an array and then selecting one, this would help incredibly. (go vote it up!)

doublehelix
  • 2,302
  • 2
  • 18
  • 15
0

This works nice, it's fast, works with multiple documents and doesn't require populating rand field, which will eventually populate itself:

  1. add index to .rand field on your collection
  2. use find and refresh, something like:
// Install packages:
//   npm install mongodb async
// Add index in mongo:
//   db.ensureIndex('mycollection', { rand: 1 })

var mongodb = require('mongodb')
var async = require('async')

// Find n random documents by using "rand" field.
function findAndRefreshRand (collection, n, fields, done) {
  var result = []
  var rand = Math.random()

  // Append documents to the result based on criteria and options, if options.limit is 0 skip the call.
  var appender = function (criteria, options, done) {
    return function (done) {
      if (options.limit > 0) {
        collection.find(criteria, fields, options).toArray(
          function (err, docs) {
            if (!err && Array.isArray(docs)) {
              Array.prototype.push.apply(result, docs)
            }
            done(err)
          }
        )
      } else {
        async.nextTick(done)
      }
    }
  }

  async.series([

    // Fetch docs with unitialized .rand.
    // NOTE: You can comment out this step if all docs have initialized .rand = Math.random()
    appender({ rand: { $exists: false } }, { limit: n - result.length }),

    // Fetch on one side of random number.
    appender({ rand: { $gte: rand } }, { sort: { rand: 1 }, limit: n - result.length }),

    // Continue fetch on the other side.
    appender({ rand: { $lt: rand } }, { sort: { rand: -1 }, limit: n - result.length }),

    // Refresh fetched docs, if any.
    function (done) {
      if (result.length > 0) {
        var batch = collection.initializeUnorderedBulkOp({ w: 0 })
        for (var i = 0; i < result.length; ++i) {
          batch.find({ _id: result[i]._id }).updateOne({ rand: Math.random() })
        }
        batch.execute(done)
      } else {
        async.nextTick(done)
      }
    }

  ], function (err) {
    done(err, result)
  })
}

// Example usage
mongodb.MongoClient.connect('mongodb://localhost:27017/core-development', function (err, db) {
  if (!err) {
    findAndRefreshRand(db.collection('profiles'), 1024, { _id: true, rand: true }, function (err, result) {
      if (!err) {
        console.log(result)
      } else {
        console.error(err)
      }
      db.close()
    })
  } else {
    console.error(err)
  }
})

ps. How to find random records in mongodb question is marked as duplicate of this question. The difference is that this question asks explicitly about single record as the other one explicitly about getting random documents.

Community
  • 1
  • 1
Mirek Rusin
  • 18,820
  • 3
  • 43
  • 36
0

For me, I wanted to get the same records, in a random order, so I created an empty array used to sort, then generated random numbers between one and 7( I have seven fields). So each time I get a different value, I assign a different random sort. It is 'layman' but it worked for me.

//generate random number
const randomval = some random value;
//declare sort array and initialize to empty

const sort = [];

//write a conditional if else to get to decide which sort to use

if(randomval == 1)
{


sort.push(...['createdAt',1]);

}

else if(randomval == 2)

{
   sort.push(...['_id',1]);
}

....
else if(randomval == n)
{
   sort.push(...['n',1]);
}
-2

If you're using mongoid, the document-to-object wrapper, you can do the following in Ruby. (Assuming your model is User)

User.all.to_a[rand(User.count)]

In my .irbrc, I have

def rando klass
    klass.all.to_a[rand(klass.count)]
end

so in rails console, I can do, for example,

rando User
rando Article

to get documents randomly from any collection.

Zack Xu
  • 11,505
  • 9
  • 70
  • 78
  • 1
    This is terribly inefficient as it will read the entire collection into an array and then pick one record. – JohnnyHK Dec 06 '13 at 13:25
  • Ok, maybe inefficient, but surely convenient. try this if your data size isn't too big – Zack Xu Dec 06 '13 at 15:16
  • 3
    Sure, but the original question was for a collection with 100 million docs so this would be a very bad solution for that case! – JohnnyHK Dec 06 '13 at 15:25
-6

you can also use shuffle-array after executing your query

var shuffle = require('shuffle-array');

Accounts.find(qry,function(err,results_array){ newIndexArr=shuffle(results_array);

rabie jegham
  • 125
  • 2
  • 3
-8

What works efficiently and reliably is this:

Add a field called "random" to each document and assign a random value to it, add an index for the random field and proceed as follows:

Let's assume we have a collection of web links called "links" and we want a random link from it:

link = db.links.find().sort({random: 1}).limit(1)[0]

To ensure the same link won't pop up a second time, update its random field with a new random number:

db.links.update({random: Math.random()}, link)
  • 2
    why *update* the database when you can just select a different random key? – Jason S Apr 08 '11 at 12:39
  • You may not have a list of the keys to select randomly from. – Mike Aug 21 '11 at 04:42
  • So you have to sort the whole collection each time? And what about the unlucky records that got large random numbers? They will never be selected. – Fantius Jan 11 '12 at 18:09
  • 1
    You have to do this because the other solutions, particularly the one suggested in the MongoDB book, don't work. If the first find fails, the second find always returns the item with the smallest random value. If you index random descendingly the first query always returns the item with the largest random number. – trainwreck Jan 17 '12 at 12:38
  • Adding a field in each document? I think it's not advisable. – CS_noob Jul 16 '16 at 17:48