-3

I've come across many of this same question here on StackOverflow. None providing a valid solid solution, so here we go:

I need to pick a random document from around 5 million documents in my MongoDB database in an efficient way.

I've tried getting the .count and using the .skip to get the random document, but it takes almost three seconds and very, very inefficient.

I can't make changes to the documents (like adding a "random") entry to each document or changing their _id's.

I've tried the solution of adding documents with an incremental _id (to pick a random _id to bypass using .skip) but this brought more headache than what it did when I try to add many documents in a short amount of time.

Adding data in an incremental way, or picking a random document, should not be this hard. I'm either missing some common knowledge, or doing something wrong, or this is what it really is..

Wanted to bring up the topic and get your responses.

Dev
  • 13,492
  • 19
  • 81
  • 174
Mia
  • 6,220
  • 12
  • 47
  • 81
  • 1
    possible duplicate of [Random record from MongoDB](http://stackoverflow.com/questions/2824157/random-record-from-mongodb) – Philipp Jun 26 '15 at 08:46
  • there is no solution in the link you provided, also some of the thread out of date. – Mia Jun 26 '15 at 08:48
  • post the sample document – karthick Jun 26 '15 at 08:51
  • 1
    @Zettam The question asks the exact same thing you are asking and has ten (10) upvoted answers. When it has no solution which is applicable to your case, then stackoverflow has none. When you think it's outdated, put a bounty on it. – Philipp Jun 26 '15 at 08:52
  • {"name":"John","content":"someLongStringHere","json":"someHueJsonObcHere"} – Mia Jun 26 '15 at 08:53

1 Answers1

1

Here is a way using the default ObjectId values for _id and a little math and logic.

// Get the "min" and "max" timestamp values from the _id in the collection and the 
// diff between.
// 4-bytes from a hex string is 8 characters

var min = parseInt(db.collection.find()
        .sort({ "_id": 1 }).limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
    max = parseInt(db.collection.find()
        .sort({ "_id": -1 })limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
    diff = max - min;

// Get a random value from diff and divide/multiply be 1000 for The "_id" precision:
var random = Math.floor(Math.floor(Math.random(diff)*diff)/1000)*1000;

// work out a "random" _id value in the range:
var _id = new ObjectId(((min + random)/1000).toString(16) + "0000000000000000")

// Then query for the single document:
var randomDoc = db.collection.find({ "_id": { "$gte": _id } })
   .sort({ "_id": 1 }).limit(1).toArray()[0];

That's the general logic in shell representation and easily adaptable.

So in points:

  • Find the min and max primary key values in the collection

  • Generate a random number that falls between the timestamps of those documents.

  • Add the random number to the minimum value and find the first document that is greater than or equal to that value.

This uses "padding" from the timestamp value in "hex" to form a valid ObjectId value since that is what we are looking for. Using integers as the _id value is essentially simplier but the same basic idea in the points.

Blakes Seven
  • 49,422
  • 14
  • 129
  • 135
  • 1
    The idea is nice, but there's some issues: calling `toArray()` (in your min/max steps) without a `.limit()` will retrieve all 5 million records. I would also suggest using [`ObjectId#getTimestamp()`](http://mongodb.github.io/node-mongodb-native/2.0/api/ObjectID.html#getTimestamp) and [`ObjectId#createFromTime()`](http://mongodb.github.io/node-mongodb-native/2.0/api/ObjectID.html#.createFromTime) instead of using substrings. – robertklep Jun 26 '15 at 09:32
  • @robertklep OOps. Typo forgot the `.limit()` thanks for the catch. The shell example was given where those methods for timestamps are not available. But their implementation code is basically that, convert the hex string part to a number. But your basically correct in that it would be a cleaner coding approach. – Blakes Seven Jun 26 '15 at 09:35