How to get random documents, which are evenly distributed, from Cloud Firestore?

Question

THIS QUESTION DOESNT PROVIDE AN ANSWER! STOP CLOSING! Firestore: How to get random documents in a collection

If there were only two ids fff...ffe and fff...fff the first one would get picked practically every time even with that descending order.

Original post:

I have a collection users, where ids are generated with Python's uuid.uuid4(). I'd like to pick a random id from the collection. Huge bonus if it can be securely and "perfectly evenly" gathered like secrets.choice(), but that isn't completely necessary.

Below is the code I'm basically using now. It works reasonably well, when there's lots of documents in the database.

from uuid import uuid4
from google.cloud import firestore_v1 as firestore

client = firestore.Client()

def get_random_user_id():
    """Try to find a random user id."""
    search_id = str(uuid4())
    print('searhing from: {}'.format(search_id))
    query = client.collection('users').where(
        firestore.field_path.FieldPath.document_id(),
        '>=',
        client.document('users/' + search_id)
    ).limit(1)
    docs = query.stream()
    for doc in docs:
        return doc.id
    # Maybe there aren't that many documents, just get the first document
    docs = client.collection('users').limit(1).stream()
    for doc in docs:
        return doc.id
    # No documents found
    return False

print(get_random_user_id())

But as you can imagine, if there aren't that many documents or the documents have ids that are almost right next to each other, the chances for them to be picked are quite different.

Lets's have an extreme example. If there were only two ids fff...ffe and fff...fff the first one would get picked practically every time.

So, is there a proper way to pick random documents evenly without maintaining a list of all documents or some other hacky workaround?

Would you like to edit the question to indicate *why* the extensive answer, written by a Google employee who works on Firestore, doesn't answer the question? Perhaps the answer is actually that the only reasonable options for picking a random document are already listed there. — Doug Stevenson, Apr 16 '20 at 03:39
Maybe if you would have read my question in the first place, instead of immediately closing, you would know. — oittaa, Apr 16 '20 at 03:41
If your collection size is small (two documents), then just read both documents and flip a coin? — Doug Stevenson, Apr 16 '20 at 03:42
Its not even about two, lets say there are aaaa, bbbbb, ccccc. The first one would be hugely overpresented and bbbbb underpresented. — oittaa, Apr 16 '20 at 03:43
It sound like you need actual entropy on your document IDs, and you need to have a number of documents that makes reading them all infeasible. Firestore is meant to operate at scale, and document IDs are typically highly randomized. It might be that your use case simply isn't fit for Firestore. — Doug Stevenson, Apr 16 '20 at 03:44
If you would have read my question, there are plenty of entropy in uuid4. It just doesn't guarantee that huge number of keys will be evenly distributed. — oittaa, Apr 16 '20 at 03:46
High entropy, by definition, would give you exactly that for large number of documents. If you have a small number of documents, just read them all and pick one randomly out of memory. — Doug Stevenson, Apr 16 '20 at 03:47
The proper way to pick a document at random is to ensure there is a field with enough entropy and a uniform distribution, and then apply the approaches outlined in the question you linked. If your sample size is too small to guarantee an event distribution, then as Doug said, consider also giving them another uniformly distributed field, such as an incremental index (which is pretty much what you'd do with Doug's selection when you load them all into memory and randomly select them by array index). — Frank van Puffelen, Apr 16 '20 at 04:08
Thank you @Frank van Puffelen. I just wanted to know if there would have been a simple way to pick a random document, with somewhat comparable distribution as `secrets.choice` as the official Firestore documentation is a bit unclear in that regard. — oittaa, Apr 16 '20 at 04:21

score 0 · Answer 1 · answered Apr 16 '20 at 07:48

As discussed in the comments from the question - thanks @DougStevenson and @FrankvanPuffelen for the clarification - the best way to work with big collections while picking random documents is a field with enough entropy and a uniform distribution, then applying the approaches outlined in this other question.

In case it's a small collection, the better way is to use some kind of incremental index, for random picking in the array with the documents.

How to get random documents, which are evenly distributed, from Cloud Firestore?

1 Answers1