0

I have a variation of an old mongodb question - on the drawing of 10 sample docs from a collection of docs including a search condition.

Starting point, I have this collection:

var CategoriesSchema   = new Schema({

    referenceID: {type: Schema.Types.ObjectId, index: true },  
    name: {type: String, index: true},
    ...
    counter: Number

});

There may be hundreds, potential thousands of categories in the collection (i.e. different "names"), each with a different referenceID that represents some other data.

I want to draw a random sample of 10 matching docs with a certain name, e.g. 10 matching docs with the name "Technology" - if there are only say 7 such docs, I want to get them all, but if there are 400, I want a (pseudo)random sample of 10. Using the cookbook answer for random samples (Random record from MongoDB), I can use a random field (0 to 1) in the schema and use a compound index for "name" and "random" to get 10 random docs.

var CategoriesSchema   = new Schema({

   referenceID: {type: Schema.Types.ObjectId, index: true },  
   name: {type: String, index: true},
   random: Math.random(),
   ...
   counter: Number
});

So, as explained in the referenced Stack Overflow answer, I would have to combine an $lte and $gte query for a random number to find my 10 (pseudo)random records - in particular because if there are fewer than 10 records and I just use $lte, I will not get all of them back each time (because 2 might be above my random number used for $lte and 5 below). So essentially, I will always be querying for 20 records (10 above and 10 below), even though I just want 10 - or have to do first a $lte query and if it does not yield 10, have to do a follow on second $gte query. In addition to all this, the cookbook answer comes with all the restrictions for a pseudo-random samples (getting "clusters" of the same docs for similar random queries etc.).

My question is whether anyone has a better solution that is still efficient for my parameters.

My analysis is:

  1. $sample doesn't work efficiently due to my requirement to query for name as a condition (?)
  2. skip with limit is not recommended due to scaling problems (if all my categories only have say a max 100 docs, fine, I could use skip with limit to draw random docs, but what if a few categories have 10,000 docs?)
  3. geospatial queries are not suitable due to the compound index nature of my schema and requirements - i.e. like A, I would need geospatial queries with a condition which I understand do not work efficiently for large records (?)

So I should just go with the cookbook answer, include the random field and my compound index with name and random and query for $lte and $gte with 10 docs each? Or is there any alternative solution anyone has implemented?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
axelwittmann
  • 369
  • 4
  • 11
  • *"`$sample` does not meet my requirement for querying a name (sic).."* Yes it does. It's actually the whole point being to use a `$match` pipeline for your query conditions and follow with `$sample` to then return a "random sample" from the results that matched the conditions. So it seems you jumped to a statement without actually understanding the usage. Perhaps you should read again. – Neil Lunn Sep 04 '17 at 09:45
  • Thanks, Neil. I was under the impression I cannot efficiently combine a $sample query with a condition (akin to a compound query). At least in the mongo docs there is no such reference. Or in other words, if I have a collection with 1 million records and I want a sample of a 10000 record subset with a specific name, $sample will not be efficient to get me 10 records of these 10000 within the 1 million records. But maybe I misread your answer. – axelwittmann Sep 04 '17 at 09:49
  • Your impression is incorrect. `.aggregate([{ "$match": { ... } },{ "$sample": ... }])` is basically the general intended usage pattern. Plus it is an aggregation pipeline stage and can essentially be used wherever in the pipeline is most appropriate. But generally the most efficient usage will be to `$match` first using an index on conditions, and then randomly sample. – Neil Lunn Sep 04 '17 at 09:54
  • Thanks. I will read up on this then. I have to admit, I still don't understand how the $match in your aggregate suggestion will not first take all of the 10000 records (which would be inefficient when scaling) and then drawing a sample - I would have thought a workable use case should just take 10 records (or in my solution max 20) from the overall collection via one index, while not requiring an intermediate stage to slice the overall collection and draw 10000 records (or potentially a million records that match the "name" field or more) first. – axelwittmann Sep 04 '17 at 10:00

0 Answers0