1

I have a type of lottery system with random picks I am trying to optimize.

I have the following constraints:

  • I need to apply the SELECT...FOR UPDATE only to rows where the deal_id is the current deal of my app (i.e i don't apply it it on the WHOLE table/on ALL the rows of the table, only on those where for example deal_id= 3 for example)
  • I need to select only rows where available=true
  • I need to output only 1 row (when a player buys a ticket I must go check these 1 million rows and RANDOMLY choose one for him (only one so many Stackoverflow solutions like here or TABLESAMPLE do not easily work)
  • I have usually about 1 million rows that match deal_id = 3 (3 as an example) and available =true (out of a total of about 30M rows at any given time)
  • I have very high READS and WRITES => about 50 to 100+ concurrent reads on the table and as a consequence same number approx of writes (as once chosen, available= true is changed to 'false' inside the SELECT..for UPDATE)
  • I have a lock while the select/update on a row is being implemented. Now I'm using SELECT..FOR UPDATE with pg_try_advisory_xact_lock (and when postgresql 9.5 goes out of beta, I will use SKIP LOCKED)
  • I need blazing fast speed. i target a query < 5ms
  • regardind IDs, there can be huge gaps between ids in the table as a whole BUT inside the 'tickets from a specific deal' (see query below) there is not any gap between IDs (not even the smallest), which i presume can matter to find the most appropriate query.

Here is my current query. It is a ARBITRARY PICK but now I want to change it/recreate it to have a want a RANDOM PICK (but avoid the usual random() limit 1 that need to go through all the 1M rows and is very slow, even maybe avoid offset(?) as it is notoriously slow on large datasets).

UPDATE tickets s
        SET available = false
        FROM (
              SELECT id
              FROM   tickets
              WHERE  deal_id = #{@deal.id}
              AND    available
              AND    pg_try_advisory_xact_lock(id)
              LIMIT  1
              FOR    UPDATE
              ) sub
        WHERE         s.id = sub.id
        RETURNING     s.name, s.id

how to change this query to move from arbitrary pick to a RANDOM pick and with the fastest query possible?

I'd like if possible tangible query suggestions, that I will try in my app.

Community
  • 1
  • 1
Mathieu
  • 4,587
  • 11
  • 57
  • 112
  • Is it really necessary to ask (basically) the same question three times? http://stackoverflow.com/q/33330915/2235885 – joop Oct 27 '15 at 10:14
  • well i am aware i failed to give complete detaild about the constraints so I ended up up with suggestions that didnt meet my needs. my bad. thats why here i give all necesary info – Mathieu Oct 27 '15 at 12:21
  • May just add `SKIP LOCKED` in the subquery? – Mikko Rantalainen May 20 '21 at 14:46

1 Answers1

1

regardind IDs, there can be huge gaps between ids in the table as a whole BUT inside the 'tickets from a specific deal' (see query below) there is not any gap between IDs (not even the smallest), which i presume can matter to find the most appropriate query.

This makes your life much easier. I'd use the following approach.

0) Create index on (deal_id, available, id).

1) Get MIN and MAX values of ID for the given deal_id.

SELECT MIN(id) AS MinID, MAX(id) AS MaxID
FROM   tickets
WHERE  deal_id = #{@deal.id}
AND    available

If this query results in index scan instead of seek, use two separate queries for MIN and MAX.

2) Generate a random integer number RandID in the found range [MinID; MaxID].

3) Pick a row with ID=RandID. The query should seek an index.

UPDATE tickets s
    SET available = false
    FROM (
          SELECT id
          FROM   tickets
          WHERE  deal_id = #{@deal.id}
          AND    available
          AND    id = @RandID
          AND    pg_try_advisory_xact_lock(id)
          LIMIT  1
          FOR    UPDATE
          ) sub
    WHERE         s.id = sub.id
    RETURNING     s.name, s.id

If there are concurrent processes that can add or delete rows consider increasing transaction isolation level to serializable.


Having said all this I realised that it won't work. When you say, that IDs don't have gaps you most likely mean that there are no gaps for IDs with the same deal_id (regardless of the value of the available column), but not among IDs that have the same deal_id AND available=true.

As soon as the first random row is set to available=false there will be a gap in IDs.


Second attempt

Add a float column RandomNumber to the tickets table that should hold a random number in the range (0,1). Whenever you add a row to this table generate such random number and save it in this column.

Create index on (deal_id, available, RandomNumber).

Order by this RandomNumber to pick a random row that is still available. The query should seek an index.

UPDATE tickets s
    SET available = false
    FROM (
          SELECT id
          FROM   tickets
          WHERE  deal_id = #{@deal.id}
          AND    available
          AND    pg_try_advisory_xact_lock(id)
          ORDER BY RandomNumber
          LIMIT  1
          FOR    UPDATE
          ) sub
    WHERE         s.id = sub.id
    RETURNING     s.name, s.id
Vladimir Baranov
  • 31,799
  • 5
  • 53
  • 90
  • Note: this about the same solution as in the answer to the OP's first question. – joop Oct 27 '15 at 12:41
  • thanks a lot for this suggestion. About your first attempt, you are totally right, there is no hole when you take tickets within a spcific deal_id but there will be for sure holes when you take tickets with specific deals AND available=true. – Mathieu Oct 27 '15 at 20:52
  • I'm quite new to postgresql so there is something I don't totally get: what is the advantage of creating a new column (random()) and put an index on it VERSUS querying the id column (primary and could have an index too): both are numbers, both have "holes"/gaps between values...why do we really need to create a new column vs using the existing id column , Is is because is is much FASTER to ORDER float values between 0 and 1 (created by random(), than to ORDER integer values between 1 and 1 million ? – Mathieu Oct 27 '15 at 20:53
  • 1
    Column `RandomNumber` is used to pick a random row. The random numbers are precalculated and you don't need to generate million random numbers each time you want to pick only one row. It takes the same time to sort float values vs integer values, but when you have only integer IDs there is no way to pick a **random** row. There is no sense in sorting integer IDs. – Vladimir Baranov Oct 27 '15 at 22:25