ABSTRACT
Talking with some colleagues we came accross the "extract random row from a big database table" issue. It's a classic one and we know the naive approach (also on SO) is usually something like:
SELECT * FROM mytable ORDER BY RAND() LIMIT 1
THE PROBLEM
We also know a query like that is utterly inefficient and actually usable only with very few rows. There are some approaches that could be taken to attain better efficiency, like these ones still on SO, but they won't work with arbitrary primary keys and the randomness will be skewed as soon as you have holes in your numeric primary keys. An answer to the last cited question links to this article which has a good explanation and some bright solutions involving an additional "equal distribution" table that must be maintained whenever the "master data" table changes. But then again if you have frequent DELETEs on a big table you'll probably be screwed up by the constant updating of the added table. Also note that many solutions rely on COUNT(*)
which is ridiculously fast on MyISAM but "just fast" on InnoDB (I don't know how it performs on other platforms but I suspect the InnoDB case could be representative of other transactional database systems).
In addition to that, even the best solutions I was able to find are fast but not Ludicrous Speed fast.
THE IDEA
A separate service could be responsible to generate, buffer and distribute random row ids or even entire random rows:
- it could choose the best method to extract random row ids depending on how the original PKs are structured. An ordered list of keys could be maintained in ram by the service (shouldn't take too many bytes per row in addition to the actual size of the PK, it's probably ok up to 100~1000M rows with standard PCs and up to 1~10 billion rows with a beefy server)
- once the keys are in memory you have an implicit "row number" for each key and no holes in it so it's just a matter of choosing a random number and directly fetch the corresponding key
- a buffer of random keys ready to be consumed could be maintained to quickly respond to spikes in the incoming requests
- consumers of the service will connect and request N random rows from the buffer
- rows are returned as simple keys or the service could maintain a (pool of) db connection(s) to fetch entire rows
- if the buffer is empty the request could block or return EOF-like
- if data is added to the master table the service must be signaled to add the same data to its copy too, flush the buffer of random picks and go on from that
- if data is deleted from the master table the service must be signaled to remove that data too from both the "all keys" list and "random picks" buffer
- if data is updated in the master table the service must be signaled to update corresponding rows in the key list and in the random picks
WHY WE THINK IT'S COOL
- does not touch disks other than the initial load of keys at startup or when signaled to do so
- works with any kind of primary key, numerical or not
- if you know you're going to update a large batch of data you can just signal it when you're done (i.e. not at every single insert/update/delete on the original data), it's basically like having a fine grained lock that only blocks requests for random rows
- really fast on updates of any kind in the original data
- offloads some work from the relational db to another, memory only process: helps scalability
- responds really fast from its buffers without waiting for any querying, scanning, sorting
- could easily be extended to similar use cases beyond the SQL one
WHY WE THINK IT COULD BE A STUPID IDEA
- because we had the idea without help from any third party
- because nobody (we heard of) has ever bothered to do something similar
- because it adds complexity in the mix to keep it updated whenever original data changes
AND THE QUESTION IS...
Does anything similar already exists? If not, would it be feasible? If not, why?