Underlying issue: I have a table main
with nearly a million rows. I wish to perform some operation for every row in main
, and store the result in a results
table.
Due to the size of the tables involved pretty much any query referencing both tables is quite slow.
Due to the amount of processing involved, I need multiple clients processing data in parallel.
What I've done: So, to speed things I have a todo
table, with the same structure as main
. Each client process randomly selects a row from todo
, does work that takes ~30 seconds, and then removes the row from todo
.
The query to randomly select is:
SELECT todo.id, todo.tag
FROM todo JOIN
(SELECT
(SELECT MIN(id) FROM todo) + (RAND() *
(SELECT MAX(id) - MIN(id) FROM todo)
) AS val --All this to get a random number between min and max
) AS r2
WHERE todo.id >= r2.val
ORDER BY id ASC
LIMIT 1
See, e.g., MySQL select 10 random rows from 600K rows fast and http://jan.kneschke.de/projects/mysql/order-by-rand/.
The problem with what I've done: I'm getting a lot of "collisions" (that is, multiple entries of results for the same row in main
. In other words, multiple clients processing the same information nearly simultaneously. This is okay, in that the db can support it (many results for each main), and I can do cleanup later, but not okay in that I'm consuming processing time unnecessarily.
When I say "a lot", I mean on the order of 10% to 15% of new entries duplicate.
I largely suspect this is due to problems with MySQL's RAND()
function, as it has a history of problems and supposedly corrected bugs. (e.g., MySQL RAND() seed values almost repeat)
However, due to the random selection, it's inevitable that there are gaps in the todo
ids, and such gaps are only going to get larger ... thus the use of >=
(which is necessary to return a result when the randomly generated id is in a gap), means that as gaps grow, collisions are increasingly likely.
So, ... how can I parallelize this process while minimizing collisions (and doing very little extra work)? (Preferably using primarily SQL, without having to add intervening layers, such as a queue service.)