11

I would like to load a random document out of a set of documents stored in a CouchDB database. The method for picking and loading the document should conform to the following requirements:

  • Efficiency: The lookup of the document should be efficient, most importantly the time to load the document must not grow linearly with the total number of documents. This means the skip query argument cannot be used.

  • Uniform distribution: The choice should be truly random (as far as possible, using standard random number generators), every document should have equal chances of being chosen.

What is the best way to implement this in CouchDB?

Christian Berg
  • 14,246
  • 9
  • 39
  • 44

5 Answers5

24

After giving this some more thought, I came up with a solution. For completeness sake, I will first show two simple approaches and explain why they are flawed. The third solution is the one I'm going with.

Approach 1: Skip

This is the trivial solution: You have a simple view (let's call it random) with a map function that emits all documents you want to choose from and the built-in _count reduce function. To pick a random document, follow these steps:

  • Find the total number of documents N in the view by calling:
    http://localhost:5984/db/_design/d/_view/random
  • Pick random number 0 <= i < N
  • Load the i'th document:
    http://localhost:5984/db/_design/d/_view/random?reduce=false&skip=i&limit=1

This approach is bad because it doesn't scale well for large numbers of documents. According to this section of "CouchDB - The Definitive Guide" the skip argument should only be used with single-digit values.

The solution above would have to loop through i documents before returning the chosen one. In SQL terms it's the equivalent of a full table scan as opposed to an index lookup.

Approach 2: Random Number in Document

With this approach, a random number is generated for each document at creation time and stored in the document. An example document:

{
  _id: "4f12782c39474fd0a498126c0400708c",
  rand: 0.4591819887660398,
  // actual data...
}

The random view then has the following map function:

function(doc) {
  if (doc.rand) {
    emit(doc.rand, doc);
  }
}      

These are the steps to pick a random document:

  • Pick a random number 0 <= r < 1
  • Load the document:
    http://localhost:5984/db/_design/d/_view/random?startkey=r&limit=1
  • If no document is returned (because r is larger than the largest random number stored in the database), wrap around and load the first document.

This is very fast and looks great at first sight. However, there's a serious flaw: not all documents have the same chance of being picked.

In the most simple example, there are two documents in the database. When I choose a random document a very large number of times, I want each document to come up half of the time. Let's say the documents were assigned the random numbers 0.2 and 0.9 at creation time. So document A is picked when (r <= 0.2) or (r > 0.9) and document B is chosen when 0.2 < r <= 0.9. The chance of being picked is not 50% for each document, but 30% for A and 70% for B.

You might think the situation improves when there are more documents in the database, but it really doesn't. The intervals between documents get smaller, but the variation in interval size get's even worse: Imagine three documents A, B and C with the random numbers 0.30001057, 0.30002057 and 0.30002058 (no other documents are in between). The chances of B being chosen are 1000 times greater than C being chosen. In the worst case, two documents are assigned the same random number. Then only one of them can be found (the one with the lower document id), the other is essentially invisible.

Approach 3: A combination of 1 and 2

The solution I came up with combines the speed of approach 2 with the fairness of approach 1. Here it is:

As in approach 2, each document is assigned a random number at creation time, the same map function is used for the view. As in approach 1, I also have a _count reduce function.

These are the steps for loading a random document:

  • Find the total number of documents N in the view by calling:
    http://localhost:5984/db/_design/d/_view/random
  • Pick random number 0 <= r < 1
  • Calculate random index: i = floor(r*N)
    My goal is to load the i'th document (as in approach 1). Assuming the distribution of random numbers is more or less uniform, I'm guessing the i'th document has a random value of approximately r.
  • Find the number of documents L with a random value lower than r: http://localhost:5984/db/_design/d/_view/random?endkey=r
  • See how far off our guess is: s = i - L
  • if (s>=0)
    http://localhost:5984/db/_design/d/_view/random?startkey=r&skip=s&limit=1&reduce=false
  • if (s<0)
    http://localhost:5984/db/_design/d/_view/random?startkey=r&skip=-(s+1)&limit=1&descending=true&reduce=false

So, the trick is to guess the random number assigned to the i'th document, look that up, see how far we're off and then skip the number of documents by which we missed.

The number of documents skipped should remain small even for large databases, since the accuracy of the guess will increase with the number of documents. My guess is that s remains constant when the database grows, but I haven't tried and I don't feel qualified to prove it theoretically.

If you have a better solution, I'd be very interested!

Christian Berg
  • 14,246
  • 9
  • 39
  • 44
  • 1
    Very interesting! What if you used #2, but each time you select a document, you change it's random number – Nick Perkins Aug 18 '11 at 18:15
  • Interesting idea. It still wouldn't be fair (the probability distribution for a single choice isn't uniform), but the effects might be cancelled out over many choices, so it could be good enough in practice. You'd have to compare the performance hit of storing a new version after every lookup with my approach #3. – Christian Berg Sep 07 '11 at 23:07
  • Have you considered instead of storing a random number with a doc just generating it in a view? I.e. something like: `function(doc) { emit(Math.random(), null) }` – Nikita Volkov Jan 30 '13 at 19:08
  • 1
    @NikitaVolkov The problem with that idea seems to be that CouchDB caches the query after running it. So even if you're running an ad-hoc dynamic query, if you ever send the same query again, it will return the same ordering of results. However, if any characters are different (I threw in an arbitrary variable assignment to a random string), your returned results will be random. – Paragon May 12 '13 at 06:04
  • I think #2 could work, if we tweak it just a little. Instead of making `r` as the `startkey`, we make it as a `startkey` or `endkey` based on a random flipping of a coin on every query. I think it passes the 0.2 and 0.9 example. – Meliodas Apr 03 '17 at 11:20
  • @Meliodas I don't think this is sufficient in the general case with many documents. See my comment on the answer by Robert Sirois below. – Christian Berg Mar 01 '18 at 10:33
2

If insert performance is not an issue you could try to make number non random e.g. make it doc_count + 1 at the time of creation. Then you could look it up with a random number 0 <= r < doc_count. But that would either require to synchronize the creation of documents or have sequence external to couchdb, e.g. an SQL database.

Best regards

Felix

Felix
  • 21
  • 1
  • That's a possibility, but as you say synchronizing document creation can become an issue - especially if you want to cluster across several CouchDB instances. Even more work needs to be done when documents are deleted. All in all, I think my "Approach 3" is more feasible. Still, +1 for your input. Thanks! – Christian Berg Sep 27 '10 at 09:51
1

How about "abusing" a view's reduce function?

function (keys, values, reduce) {
    if (reduce)
      return values[Math.floor(Math.random()*values.length)];
    else
      return values;
}
Shawn
  • 32,509
  • 17
  • 45
  • 74
0

I agree with @meliodas:

Here's the distribution of option 2 (n=1000):

{ 0.2: 233,
  0.9: 767 }

And with swapping startkey/endkey half the time:

{ 0.2: 572,
  0.9: 428 }

Not sure what happens to the distribution when you look at more data, but it initially seems a bit more promising. This is without using option 1 at all, which I don't think is necessary.

  • I think this only works in this trivial case with two documents. If you have three documents with random values 0.499, 0.5, 0.501, the one in the middle will have a very low chance of being picked, no matter from which side you come. Would you mind trying it and sharing the outcome? – Christian Berg Mar 01 '18 at 10:24
  • I imagine you're right with that particular test set. I think the point of the discussion was that without a fair algorithm implemented on the database everything else is a hack and will have deficiencies. I'd be happy to run the results though when I get a chance later this week. – Robert Sirois Mar 02 '18 at 10:45
  • @ChristianBerg This method is supposed to work, when you're setting `rand` property in all three docs randomly. In a random distribution, there's almost zero chance of getting three values as close as 0.499, 0.5, 0.501. I think random distribution should place all the points in an equidistant manner, which should suffice. Please correct me If I'm wrong? – Meliodas Mar 03 '18 at 06:50
0

Approach 2b: Sequential Number in Document

This approach is similar to Approach 2 mentioned in this answer . That Approach 2 uses random numbers twice (once in the document itself and once in the process of picking a document). This Approach 2b will only use random numbers on the picking process and use sequential integers on the documents. Note that it will not work, if documents are deleted (see below). Here is how it works:

Add sequential integers to your documents at creation time:

{
    _id: "4f12782c39474fd0a498126c0400708c",
    int_id : 0,
    // actual data...
}

another doc

{
    _id: "a498126c0400708c4f12782c39474fd0",
    int_id : 1,
    // actual data...
}

and just count up by one with each document.

The view random has the same map function (although you might want to change its name to something other than "random"):

 function(doc) {
   if (doc.int_id) {
     emit(doc.int_id, doc);
   }
 }  

These are the steps for loading a random document:

  • Find the total number of documents N in the view by calling:
    http://localhost:5984/db/_design/d/_view/random
  • Pick random number 0 <= r < 1
  • Calculate random index: i = floor(r*N)
  • Load the document:
    http://localhost:5984/db/_design/d/_view/random?startkey=i&limit=1

This way we chose an even distribution of int_id from 0 to N-1 by design. Then, we pick a random index (between 0 and N-1) and use it on that even distribution.

Note

This approach does not work anymore, when documents in the middle or at the beginning are deleted. The int_id has to start at 0 and go up to N-1.

Mue
  • 191
  • 2
  • 8