0

This procedure is the most frequently accessed in this target application. Assume concurrent operations, and t.value is always changing.

-- Table is MySQL InnoDB
-- let's call this MainSelect
SELECT t.Id
FROM table t
WHERE t.A = conditionA AND t.B = conditionB AND t.value > 0
ORDER BY RAND()
LIMIT 1 INTO vIndex FOR UPDATE;

-- IF vIndex THEN
UPDATE table SET value = value - 1 WHERE id = vIndex

The goal is to modify this query to use a form of this method of random row selection for its speed. Here for completeness. This is the main question of this post.

SELECT name
  FROM random AS r1 JOIN
       (SELECT (RAND() *
                     (SELECT MAX(id)
                        FROM random)) AS id)
        AS r2
 WHERE r1.id >= r2.id
 ORDER BY r1.id ASC
 LIMIT 1

Discussion:

How would the total number of rows in MainSelect determined?

If the answer to this is to make the MainSelect subquery, moving the FOR UPDATE to the outer most query, the randomly selected row's t.value could become 0 before the outer SELECT locks the row with the FOR UPDATE. Something like:

SELECT * FROM (firstquery) s ...random selection logic.. FOR UPDATE;

If this consideration is accurate, this places the question on which transaction level should be set at the onset.

Thanks

Edit - Notes while working:

  1. Maybe http://en.wikipedia.org/wiki/Reservoir_sampling, because the count is unknown. I'd like to avoid high isolation levels as I would expect lower throughput as a result.

  2. Perhaps a random number could be stored and indexed, rather than computed. Then a random number is selected, and according to limit documentation, selecting one at random is very fast. The problem with this is the result set will not be uniform.

If you use LIMIT row_count with ORDER BY, MySQL ends the sorting as soon as it has found the first row_count rows of the sorted result, rather than sorting the entire result. If ordering is done by using an index, this is very fast.

Community
  • 1
  • 1
fionbio
  • 3,368
  • 2
  • 23
  • 38
  • How many rows does the table have and what percentage match the `t.A = conditionA AND t.B = conditionB AND t.value > 0` condition? – ypercubeᵀᴹ Oct 12 '13 at 18:15
  • The number of rows is not constant, and the percentage match is unknown as the match criteria and t.value are always changing. However, the row count could be fixed to only change n rows at a single fixed time per day if this is useful. Also, t.A, and t.B are fixed and will not change for the lifetime of the row. – fionbio Oct 12 '13 at 18:28
  • I asked for an estimate, if table has thousands, millions or billions of rows? – ypercubeᵀᴹ Oct 12 '13 at 18:30
  • 1 million or less. :) .. and now you're going to tell me I'm optimizing prematurely, or that `order by RAND()` is performant for this many rows. – fionbio Oct 12 '13 at 18:33
  • No, I'm not at all saying that, only trying to figure the situation more accurately. `ORDER BY RAND()` is not going to be fast with a million matching rows. – ypercubeᵀᴹ Oct 12 '13 at 18:35
  • I guess I wasn't very accurate in my first reply. In practice, the criteria matches for t.A & t.B is expected to bring the result count to <= 1% of the total row count. Still interested in a solution if you have ideas. Thanks. – fionbio Oct 12 '13 at 18:43

0 Answers0