3

ABSTRACT

Talking with some colleagues we came accross the "extract random row from a big database table" issue. It's a classic one and we know the naive approach (also on SO) is usually something like:

SELECT * FROM mytable ORDER BY RAND() LIMIT 1

THE PROBLEM

We also know a query like that is utterly inefficient and actually usable only with very few rows. There are some approaches that could be taken to attain better efficiency, like these ones still on SO, but they won't work with arbitrary primary keys and the randomness will be skewed as soon as you have holes in your numeric primary keys. An answer to the last cited question links to this article which has a good explanation and some bright solutions involving an additional "equal distribution" table that must be maintained whenever the "master data" table changes. But then again if you have frequent DELETEs on a big table you'll probably be screwed up by the constant updating of the added table. Also note that many solutions rely on COUNT(*) which is ridiculously fast on MyISAM but "just fast" on InnoDB (I don't know how it performs on other platforms but I suspect the InnoDB case could be representative of other transactional database systems).

In addition to that, even the best solutions I was able to find are fast but not Ludicrous Speed fast.

THE IDEA

A separate service could be responsible to generate, buffer and distribute random row ids or even entire random rows:

  • it could choose the best method to extract random row ids depending on how the original PKs are structured. An ordered list of keys could be maintained in ram by the service (shouldn't take too many bytes per row in addition to the actual size of the PK, it's probably ok up to 100~1000M rows with standard PCs and up to 1~10 billion rows with a beefy server)
  • once the keys are in memory you have an implicit "row number" for each key and no holes in it so it's just a matter of choosing a random number and directly fetch the corresponding key
  • a buffer of random keys ready to be consumed could be maintained to quickly respond to spikes in the incoming requests
  • consumers of the service will connect and request N random rows from the buffer
  • rows are returned as simple keys or the service could maintain a (pool of) db connection(s) to fetch entire rows
  • if the buffer is empty the request could block or return EOF-like
  • if data is added to the master table the service must be signaled to add the same data to its copy too, flush the buffer of random picks and go on from that
  • if data is deleted from the master table the service must be signaled to remove that data too from both the "all keys" list and "random picks" buffer
  • if data is updated in the master table the service must be signaled to update corresponding rows in the key list and in the random picks

WHY WE THINK IT'S COOL

  • does not touch disks other than the initial load of keys at startup or when signaled to do so
  • works with any kind of primary key, numerical or not
  • if you know you're going to update a large batch of data you can just signal it when you're done (i.e. not at every single insert/update/delete on the original data), it's basically like having a fine grained lock that only blocks requests for random rows
  • really fast on updates of any kind in the original data
  • offloads some work from the relational db to another, memory only process: helps scalability
  • responds really fast from its buffers without waiting for any querying, scanning, sorting
  • could easily be extended to similar use cases beyond the SQL one

WHY WE THINK IT COULD BE A STUPID IDEA

  • because we had the idea without help from any third party
  • because nobody (we heard of) has ever bothered to do something similar
  • because it adds complexity in the mix to keep it updated whenever original data changes

AND THE QUESTION IS...

Does anything similar already exists? If not, would it be feasible? If not, why?

Community
  • 1
  • 1
Luke404
  • 10,282
  • 3
  • 25
  • 31

2 Answers2

1

The biggest risk with your "cache of eligible primary keys" concept is keeping the cache up to date, when the origin data is changing continually. It could be just as costly to keep the cache in sync as it is to run the random queries against the original data.

How do you expect to signal the cache that a value has been added/deleted/updated? If you do it with triggers, keep in mind that a trigger can fire even if the transaction that spawned it is rolled back. This is a general problem with notifying external systems from triggers.

If you notify the cache from the application after the change has been committed in the database, then you have to worry about other apps that make changes without being fitted with the signaling code. Or ad hoc queries. Or queries from apps or tools for which you can't change the code.

In general, the added complexity is probably not worth it. Most apps can tolerate some compromise and they don't need an absolutely random selection all the time.

For example, the inequality lookup may be acceptable for some needs, even with the known weakness that numbers following gaps are chosen more often.

Or you could pre-select a small number of random values (e.g. 30) and cache them. Let app requests choose from these. Every 60 seconds or so, refresh the cache with another set of randomly chosen values.

Or choose a random value evenly distributed between MIN(id) and MAX(id). Try a lookup by equality, not inequality. If the value corresponds to a gap in the primary key, just loop and try again with a different random value. You can terminate the loop if it's not successful after a few tries. Then try another method instead. On average, the improved simplicity and speed of an equality lookup may make up for the occasional retries.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
1

It appears you are basically addressing a performance issue here. Most DB performance experts recommend you have as much RAM as your DB size, then disk is no longer a bottleneck - your DB lives in RAM and flushes to disk as required.

You're basically proposing a custom developed in-RAM CDC Hashing system.

You could just build this as a standard database only application and lock your mapping table in RAM, if your DB supports this.

I guess I am saying that you can address performance issues without developing custom applications, just use already existing performance tuning methods.

Nick.Mc
  • 18,304
  • 6
  • 61
  • 91
  • Good advice, but FWIW there is no feature to lock a table in RAM in MySQL. – Bill Karwin Nov 07 '12 at 03:37
  • My apologies, I missed the MySQL part. But if yuo have enough RAM then it would effectively be locked in anyway. What I'm getting at is you probably shouldn't worry too much about the RAM caching aspects as databases do that themselves automatically. Further, if you force the usage of RAM for this purpose that might be at the expense of the database performing poorly due to lack of RAM.... with a net loss of performance! – Nick.Mc Nov 07 '12 at 05:07
  • @BillKarwin, what about the MEMORY storage engine? – Luke404 Nov 07 '12 at 10:35
  • @ElectricLlama you didn't miss it. I made a couple examples on MySQL (re: COUNT(*)) but also cited "other platforms". My question was deliberately RDBMS-agnostic, as long as we're talking about SQL dbs. I like your idea to implement the same functionality inside the db, but I also think doing it outside helps a little with scalability (you're moving some memory and processing requirements somewhere else, possibly on a dedicated system). – Luke404 Nov 07 '12 at 10:43