18

The question about the way to get a random document from collection has been asked many times and there were suggestions on this topic.

What I need is to get several random documents from collection and what is even worse - those documents must match certain criterias (filtered, I mean). For example, I have a collection of articles where each article has a 'topic' field. The User chooses a topic he's interested in, and my db must show the corresponding articles each time in random order.

Obviously, previously discussed hacks won't help me. The only way to achieve what I want is to query for corresponding topic getting ids only:

var arr = db.articles.find({topic: 3}, {_id:1}).toArray();

and then generate random sequense of numbers depending on how many documents were received and then obtain document ids from the array using random numbers as indexes of that array and then finally do another request to mongodb to obtain documents with those randomly chosen ids.

As you can see, it seems a little bit too slow altogether, especially, if there are too many articles returned by first query:)

So what I think is that there may be some mongodb command to get documents by index keys based on their position in the index. The point is that I can create covered compound index like this:

db.articles.ensureIndex({topic: 1, _id:1});

And now my query would only have to scan the continuos line of right _ids in index. And if I could request the documents from the collection by those '_ids' positions, then I could do the whole thing in one request! Something like:

var cursor = db.articles.find({topic:3, $indexKeyPosition: {$in: myRandomSequence}});

Does anyone know about such features?

Andrey Kon
  • 747
  • 2
  • 6
  • 16

4 Answers4

10

Nowadays, you should be able to use the $sample aggregation function.

Example (untested):

db.articles.aggregate([
    { $match : { topic : 3 } },
    { $sample : { size: 3 } }
])

Note, however, that – if using the older MMAPv1 storage engine – it may return the same document more than once.

Vince Bowdren
  • 8,326
  • 3
  • 31
  • 56
Claudiu
  • 224,032
  • 165
  • 485
  • 680
  • `$sample` works on MongoDB 3.4.9, returning distinct documents. Sample won't return more documents that your collection contains. – Stijn Van Bael Oct 13 '17 at 12:47
  • 4
    @StijnVanBael The [docs](https://docs.mongodb.com/master/reference/operator/aggregation/sample/#pipe._S_sample) say "**WARNING: $sample may output the same document more than once in its result set.**" – Vince Bowdren Dec 04 '17 at 10:32
  • If all the following conditions are met, `$sample` uses a pseudo-random cursor to select documents: - `$sample` is the first stage of the pipeline, - N is less than 5% of the total documents in the collection, - The collection contains more than 100 documents – Dharmendra Yadav Aug 16 '19 at 08:43
  • 1
    @VinceBowdren This warning applied only to MMAPv1 storage engine, which is has been removed in MongoDB v4.2 and later. In the docs for v4.0 they clarified, that the duplication may occur only on MMAPv1: https://www.mongodb.com/docs/v4.0/reference/operator/aggregation/sample/ – pepkin88 May 23 '23 at 13:19
  • 1
    @pepkin88 thanks. I've amended the answer to clarify. – Vince Bowdren May 23 '23 at 14:32
4

So what I think is that there may be some mongodb command to get documents by index keys based on their position in the index. The point is that I can create covered compound index like this:

No there isn't such a function in MongoDB, though it is a good idea to be able to randomise a result set. In the meantime here is a JIRA: https://jira.mongodb.org/browse/SERVER-533

Since there is no way to pick from a position of index so that it can use an index and consequently a single round trip you have no choice but you open multiple cursors.

The current solution depends on how many documents are in your result set.

If there are a small number of document in your result set you can solve this with an easy skip(rand()) and limit(1) however you must be aware that both skip() and limit() make no effective use of indexes.

That does not mean it will scan the entire Btree it means it will scan as far as you skip().

This means that if your result set grows large and the rand() becomes a large number you will see serious performance problems, as many have.

One good way of possibly solving this is to maintain either:

And use that new field to "skip" on using the rest of your query like:

var arr = db.articles.find({topic: 3, rand: rand()}, {_id:1}).limit(7).toArray();

Will get 7 random rows using the 0 to 1 idea.

The random sort abilities of this rely on an ever changing data set to help create a randomness within the sort. This will, of course, not work if the result set is continously static.

As for using batchSize, it becomes irrelavant here and infact normally does. For example your logic of using BatchSize to get all your results back does not make complete sense since the BatchSize normally has a absolute max size of 16MB. This means that if your documents are large you may not be getting the single round trip you think you are.

This also only dictates that the server will send all this data at once, it does not denote the amount of work placed on the server, just the amount of data sent over the wire at once.

So considering that you must do this with mutliple cursors (the way I would recommend) you can just run:

var arr = db.articles.find({topic: 3, rand: {$gte:rand()}}).sort({rand:1}).limit(1);

Several, or however many you need, times over. This is not much different to a normal iteration of a cursor and provided you have the right indexes should be quite fast.

There is one other method, but I do not recommend it. You can run an MR, say, once every hour or something which creates another collection of _id and rand() this means you can do the first query I showed:

var arr = db.articles.find({topic: 3, rand: rand()}, {_id:1}).limit(7).toArray();

And really get 7 random records, since the rand() will, of course, be different. But this is not real time and nor would it be very nice for your server on a large data set as such I do not recommend such a thing.

Edit

There is one other way. With the auto incrementing id you could do an $or statement to pick out 7 rand()s all at once. However this introduces yet another problem, deleting.

If you delete any rows you might hit a rand() that doesn't exist and so no row will be returned. Since the auto incrementing id is not maintained to counter deletes server-side you would have to do this yourself. This would not be an easy or scalable thing to do.

To add to this $or statements cannot be limit()ed on clause which means you can't get around this by doing sub select type $ors to make MongoDB only pick out one result per $or clause using $gte.

The same applies for a rand() between 0 and 1. This would work with an $or if you could limit clauses.

Sammaye
  • 43,242
  • 7
  • 104
  • 146
  • ehh, I don't quite understand how your query can return several docs when you query by rand field which has one certain value per document... – Andrey Kon Dec 17 '12 at 09:04
  • It depends upon the random order you require, if you have a set of _ids already then the random pick does not matter as such you only require random order as such that query satisifies the need, if you need random picking then you can just use the query however many times you need to pick one out at a time atm, btw here is a JIRA for your kind of functionality: https://jira.mongodb.org/browse/SERVER-533?focusedCommentId=15107&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel – Sammaye Dec 17 '12 at 09:09
  • No, this doesn't seem to be the right solution at all. Anyway, thanks for trying to help – Andrey Kon Dec 17 '12 at 09:14
  • Whats wrong with the solution? You cna just run the random picker query seven times? I mean it is not much different to iterating a normal cursor there. – Sammaye Dec 17 '12 at 09:20
  • The only other alternative I have found is to use an MR to update a random sort collection to be used as a lookup, I would not advise this at all... – Sammaye Dec 17 '12 at 09:25
  • Several round-trips to db is never a good idea. What you suggested is essentially very close to my own solution, but my solution required two round-trips while yours requires even more:) What I search for is possibility to do everything in one request. – Andrey Kon Dec 17 '12 at 09:33
  • No each time you query a cursor you are making a round trip, hence why embedding is only one round trip to get a document, so actually you are running the query more times than you think, MongoDB doesn't have "result sets" like SQL techs. – Sammaye Dec 17 '12 at 09:36
  • You're not right:) I can set batchSize parameter to 20 and I will get 20 docs within one round-trip. – Andrey Kon Dec 17 '12 at 09:39
  • BatchSize would only help if the JIRA ticket was implemented, though of course it requires that your documents fit into the max batch size (normally 4mb) Which was unclear as to whether that would help you or not. I mean what if you want more than can fit into the 4mb space? I.e. you have 1 document that is 10mb big? Plus batch size doesn't limit the number of run queries however you are right in thinking it means only one trip to the server itself. – Sammaye Dec 17 '12 at 09:54
  • That should divulge most information on the subject to date, I edited from the one I wrote when I had just woken up this morning – Sammaye Dec 17 '12 at 17:28
2

You can (as in pagination) count how many documents match the query. Then make N queries with skip(random_value) and limit(1).

db.collection.count({field:value,field2:value2})

db.collection.find({field:value,field2:value2}).skip(n).limit(1)

If the collection is indexed for the query it must be fast.

francadaval
  • 2,451
  • 3
  • 26
  • 36
  • 3
    No, skip and limit traverses the whole tree. It was discussed many times before that this approach is inefficient... Besides that, you suggest to make several requests which doesn't sound fast... – Andrey Kon Dec 17 '12 at 09:05
1

Starting in Mongo version 4.4.2 the $rand aggregation operator was added. You can now generate a random number and save it as a new field, then we can use this new randomly generated value to sort, like so:

db.collection.aggregate([
    {
        $match: filterQuery,
    },
    {
        $addFields: {
            tmpOrder: { '$rand': {} },
        },
    },
    {
        $sort: {
            tmpOrder: 1,
        },
    },
]);
Tom Slabbaert
  • 21,288
  • 10
  • 30
  • 43