Parallelization in MySQL, and problems with `RAND()`

Question

Underlying issue: I have a table main with nearly a million rows. I wish to perform some operation for every row in main, and store the result in a results table.

Due to the size of the tables involved pretty much any query referencing both tables is quite slow.

Due to the amount of processing involved, I need multiple clients processing data in parallel.

What I've done: So, to speed things I have a todo table, with the same structure as main. Each client process randomly selects a row from todo, does work that takes ~30 seconds, and then removes the row from todo.

The query to randomly select is:

SELECT todo.id, todo.tag
FROM todo JOIN
    (SELECT
        (SELECT MIN(id) FROM todo) + (RAND() *
           (SELECT MAX(id) - MIN(id) FROM todo)
        ) AS val   --All this to get a random number between min and max
    ) AS r2
WHERE todo.id >= r2.val
ORDER BY id ASC
LIMIT 1

See, e.g., MySQL select 10 random rows from 600K rows fast and http://jan.kneschke.de/projects/mysql/order-by-rand/.

The problem with what I've done: I'm getting a lot of "collisions" (that is, multiple entries of results for the same row in main. In other words, multiple clients processing the same information nearly simultaneously. This is okay, in that the db can support it (many results for each main), and I can do cleanup later, but not okay in that I'm consuming processing time unnecessarily.

When I say "a lot", I mean on the order of 10% to 15% of new entries duplicate.

I largely suspect this is due to problems with MySQL's RAND() function, as it has a history of problems and supposedly corrected bugs. (e.g., MySQL RAND() seed values almost repeat)

However, due to the random selection, it's inevitable that there are gaps in the todo ids, and such gaps are only going to get larger ... thus the use of >= (which is necessary to return a result when the randomly generated id is in a gap), means that as gaps grow, collisions are increasingly likely.

So, ... how can I parallelize this process while minimizing collisions (and doing very little extra work)? (Preferably using primarily SQL, without having to add intervening layers, such as a queue service.)

Have you tried using a `MOD` capability? for example, having 4 parallel processes using `WHERE id MOD 4 = 0`, `WHERE id MOD 4 = 1`, etc.? it is fixed number of processes, but u can do as many as ud like without collision. — PlantTheIdea, Mar 25 '15 at 17:58
@PlantThelda The number of client processes running varies, so `MOD` isn't an easy answer. — alttag, Mar 25 '15 at 18:04
Alright then, next question ... why does the row selection need to be random? Can you not establish a queue and process as FIFO? — PlantTheIdea, Mar 25 '15 at 18:11
As explained above, I'd like to avoid, if possible, writing a queue service layer. (Yes, it's pretty trivial to do, but still consumes network resources and has consequent latency of each check.) Random was a half-measure, but as I'm finding, perhaps not the best one. — alttag, Mar 25 '15 at 18:20

Parallelization in MySQL, and problems with `RAND()`

0 Answers0