0

I have a (Ruby on Rails 4) app and using a postgresql query that basically look into a Opportunity table, search randomly among the rows those where opportunity is 'available=true' and update these selected rows with ' available = false'. Each time a user clicks on 'try an opportunity' the app uses the query below.

UPDATE opportunities s
SET    opportunity_available = false
FROM  (
   SELECT id
   FROM   opportunities
   WHERE  deal_id = #{@deal.id}
   AND    opportunity_available
   AND    pg_try_advisory_xact_lock(id)
   LIMIT  1
   FOR    UPDATE
   ) sub
WHERE     s.id = sub.id
RETURNING s.prize_id, s.id;
// inspired by https://stackoverflow.com/questions/33128531/put-pg-try-advisory-xact-lock-in-a-nested-subquery

I've been struggling for 5 days but I have managed to understand now roughly how it's behaving (badly): now I need to know how to change it.

Indeed, the app "becomes a little crazy" and select (and update the row) exactly by the latest I updated.

Let me be clear on my process on how I create these opportunities

  • I as admin create 20 opportunities (rows) via my app as an admin

  • then in my admin panel, I create a prize which a quantity let's say prize id =45 with quantity = 4. the app will go in Opportunity table and fill randomly (this part works perfectly) 4 rows with prize_id= 45.

Now a user runs the app and as explained above, the app will not choose randomly a row but will begin always by the latest updated: it seems the takes one of the row with a prize, then another with a prize, then another and never those without prizes (with prized_id= empty)...

Now I made some manual experiments: I tried manually change the values on my table (via pgadmin), the weirdest thing happen: if I modify row 32, then row 45, then row 67, when the user tries to play again, guess what, the lines that are picked randomly are exactly the ones I updated in the reverse order: it will pick line 67 then line45 then line 32.. It won't even consider the other rows that could be chosen (all the others opportunities where available= true).

I also tried not to use for update or the 'pg_try_advisory_xact_lock(id)' line and it seems it still has the same issue.

As, as an admin, I create first the total 20 rows, THEN the 4 winning rows, they're the last to be udpated (even if on my pgadmin screen they stay on the same rows...maybe in the backround, postgresql is ordering them as the last updated consecutively and selecting them?) and that's kind of why as soon as one of 4 these winning rows is selected, then all the others follow.

To be clear I would be OK with the app going row by row to select each opportunity available (ex: row 3, then row 4, then row 5 as per what I see on pgadmin as rows are already attributed the prizes totally randomly). the problem is it's not doing this, it's taking often all the winning rows consecutively....

I'm speechless and no clue on how to break this pattern.

Note: this pattern is not happening continuously 100% of the time but very often: for example if i have more than 4 winning rows, it's like for 2 minutes if I keep clicking as a user it behaves as said here, then it stops and seems (or I might be wrong) to behave normally then again after & minute will again only select winning rows...

EDIT 1

Here is how prizes are injected inside Opportunity table (example whehn I create a prize which id is 21 and quantity of this prize= 3)=> it randomly send them (as far as I understand) but only where there is not already a prize_id (i.e. if opportunity has prize_id=empty, it can put it there)

  SQL (2.4ms)  
  UPDATE "opportunities"
  SET "prize_id" = 21
  WHERE "opportunities"."id" 
  IN (
    SELECT "opportunities"."id"
    FROM "opportunities"
    WHERE (deal_id = 341 AND prize_id IS NULL)
    ORDER BY RANDOM()
    LIMIT 3) //
   (0.9ms)  COMMIT

This SQL query is generated by a Rails gem (called Randumb: github.com/spilliton/randumb)

EDIT 2

I moved to another more precise question: Postgresql 9.4 - FASTEST query to select and update on large dataset (>30M rows) with heavy writes/reads and locks

Indeed I think the problem is really that I need a truly RANDOM pick and move away from ARBITRARY pick.

It was already said by Erwin here Advisory locks or NOWAIT to avoid waiting for locked rows? but now I understand what he meant ("Arbitrary" also matters because it implies that Postgresql will typically pick the same row for the same query arbitrarily, which makes lock contention a much bigger issue than truly random picks would. " postgresql is free to choose the fastest way to output the 'LIMIT 1' and it's picking always the same rows= those which have been upated the last ones. But I can't do with this as the latest updated are all Winning opportunities.

Community
  • 1
  • 1
Mathieu
  • 4,587
  • 11
  • 57
  • 112
  • "... postgresql is ordering them as the last updated consecutively and selecting them" - in postgresql, an update is implemented by adding a new version of the row in the table and marking the old version as obsolete, so this is pretty much correct – araqnid Oct 25 '15 at 15:03
  • so I had the right intuition: now as the fact is I am creating/sending the prizes inside the Opportunity table at the END of my 'Admin' process (so creating a new version of them as you say and maybe the latest on postgresql list => likely to be taken by the limit 1), how can I still manage to get a RANDOM pick (even if as said in the question, that would be the order I see the rows inside pgamdin) and not have the latest (all winning) rows be shown consecutively? – Mathieu Oct 25 '15 at 15:05
  • it surprises me that there's no randomness being added into your inner query- e.g. an `order by random() limit 1`- it also surprises me that it seems to be taking the row from the *end* of the table, though. – araqnid Oct 25 '15 at 15:06
  • the prizes are ALREADY randomly sent e.g for example, i am descrinbing you what I can SEE on pgadmin interface: I create 20 opportunities= 20 rows from line 1 to 20 then i go in my admin panel and create a prize with quantity 4: it will send them randomly for ex on line 6, 9, 13 and 17 (tat least that's what I see on pgadmin) – Mathieu Oct 25 '15 at 15:08
  • but maybe this "random" feeling , in the background, is not really random – Mathieu Oct 25 '15 at 15:10
  • I'm going to post how prizes are 'randomly injected' inside Opportunities. but for sure I first create 20 opportunities THEN I send 4 winning rows: they're the last rows inside Opportunities to be updated:) – Mathieu Oct 25 '15 at 15:11
  • I think the query against the table needs to "shuffle" the rows, otherwise it will read them in a deterministic (if obscure) order. e.g. the order they occur inside the table file. Could you try, as an experiment, creating the opportunities all in one batch with some already marked as winning? e.g. prepared them in a temp table and copy them in? – araqnid Oct 25 '15 at 15:12
  • I have put the postgresql query I see in my terminal when I create a prize and it is injected inside the Opportunities table – Mathieu Oct 25 '15 at 15:18
  • I agree when you say " I think the query against the table needs to "shuffle" the rows, otherwise it will read them in a deterministic (if obscure) order. e.g. the order they occur inside the table file. Could you try, as an experiment, creating the opportunities all in one batch with some already marked as winning? e.g. prepared them in a temp table and copy them in?"" that's why I was using randumb Rails gem that send 'randomly' the 3 prizes among the 20 opportunities (see my example) – Mathieu Oct 25 '15 at 15:20
  • "Could you try, as an experiment, creating the opportunities all in one batch with some already marked as winning": that's smart. let me check : i could tell the app put a prize_id=5 on all the opportunities you create but it would be hard to only put prizes on certain rows, it would be on all rows. my process today is create the 20 opportunities (inside a deal), then create a prize with quantity= 5 for example and inject them using Randumb inside the 20 already created opportunities rows. – Mathieu Oct 25 '15 at 15:24
  • "Could you try, as an experiment, creating the opportunities all in one batch with some already marked as winning": quite hard for me and definitely not trivial, i'd need to change my app a lot – Mathieu Oct 25 '15 at 15:24
  • when I create the 20 lines of opportunities, the admin (me) has not yet created the prizes. so I can't already tell the app hey put these prizes in these opportuntiies. i first need to go in the admin, create the prize and THEN tell the app to inject them inside the opportunities – Mathieu Oct 25 '15 at 15:26
  • I'm a rookie in postgresql but now it's like postgresql can freeely choose the best way/"obscure" way to pick the row. Maybe my only solution is to change from a simple 'limit 1' to sth that would TELL postgresql the way I need it to pick it : for example maybe an 'order' ? but I'd need to choose the CHEAPEST thing for this as in real life, I 'd need the app to make the query on top of my post (check the available opportunities) not on 20 lines but on 1 million lines:) So I need a pick that would be ultra fast, that's why I had chosen only to put 'limit 1' as it's the cheapest "arbitrary pick" – Mathieu Oct 25 '15 at 15:37
  • @wildplasser yes that's the idea:): when I create a prize, it is injected RANDOMLY inside the Opportunities rows. But in this question, i don't think this is related with the issue at stake, isn't it? My problem is inside the Opportunities table, when the query (on top of my post) is performed, it DOEs jot select randomly and seem to show to the user all the prizes consecutively (as if it was showing the latest rows of Opportunities that were updated) – Mathieu Oct 25 '15 at 15:51
  • Yes it is. I assume you want to pick just three random tuples from a set of candidates, and toggle some boolean flag. You could do that either by a subquery like yours, or by means of a window function. The locking is a different issue. Your problem is in cuncurrent access / locking? – wildplasser Oct 25 '15 at 15:54
  • `order by random() limit n` isn't pointless – araqnid Oct 25 '15 at 15:56
  • "I assume you want to pick just three random tuples from a set of candidates, and toggle some boolean flag" not sure to understand this but yes : admin creates 20 opportunties, then admin creates 5 prizes that are injected "randonmly to 5 opportunities (see query at bottom of my post". then when user click 'see an opportunity', it use the query on top of my post : it must CHOOSE ONE of the 20 opportunities -that are still available' (randomly) – Mathieu Oct 25 '15 at 15:57
  • @araqnid "order by random() limit n isn't pointless")> you mean it would be relevant to try ? – Mathieu Oct 25 '15 at 15:58
  • @wildplasser "The locking is a different issue. Your problem is in cuncurrent access / locking?": yes the locking i think is a different issue. i need it because many concurrent calls (more details here: http://stackoverflow.com/questions/33129132/advisory-locks-or-nowait-to-avoid-waiting-for-locked-rows and http://stackoverflow.com/questions/33128531/put-pg-try-advisory-xact-lock-in-a-nested-subquery) – Mathieu Oct 25 '15 at 16:00
  • @araqnid: I see now; the question is rather verbose. The intention was not immediately clear to me. Now, the problem isn't clear, either. Maybe just concurrency/ failing to commit? – wildplasser Oct 25 '15 at 16:00
  • no I tried without for update and without advisory locks, same problem. the issue does not come according to me (rookie) from the concurrent clauses. – Mathieu Oct 25 '15 at 16:00
  • coming from another post, an expert said "@Mathieu: If a query does not specify ORDER BY, Postgres is free to chose the most efficient way to return qualifying rows in any order. Typically that's a sequential scan. So the order is not random but arbitrary: the same query will probably (but nor necessarily) return the same rows in the same order, even without ORDER BY. Unfortunate for a queuing system like yours - but typically still faster than imposing random order. The current physical position of rows in a table can be found in the system column ctid and can change any time." – Mathieu Oct 25 '15 at 16:08
  • Table definitions: `opportunities.id` is the primary key, or is `deal_id` also part of the key? – wildplasser Oct 25 '15 at 16:08
  • So maybe I need to move away from arbitrary pick but what would be the next CHEAPEST and fastets way to do it? – Mathieu Oct 25 '15 at 16:08
  • Note: a sequential scan need not be sequential from your point of view. (there are rowversions, and maybe even visibility) – wildplasser Oct 25 '15 at 16:10
  • @wildplasser to answer your question: in opportunity table, I have multiple indexes 1. add_index "opportunities", ["deal_id", "opportunity_available"], name: "index_on_deal_and_availability", using: :btree and 2. add_index "opportunities", ["prize_id"], name: "index_prize_id", using: :btree. these are Rails indexes but it translates in 1. index on (deal_id, opportunity_availiblity) and 2. index on prize_id – Mathieu Oct 25 '15 at 16:11
  • @wildplasser inside the Opportunity table, you ask about "opportunities.id": well id is of course the primary key – Mathieu Oct 25 '15 at 16:12
  • "Note: a sequential scan need not be sequential from your point of view. (there are rowversions, and maybe even visibility)"=> I'm sorry but don't understand at that high level. i'm just not good enough with postgresql. How should i change my query to make it work? – Mathieu Oct 25 '15 at 16:13
  • @araqnid "order by random() limit n isn't pointless "=> the problem with this is that order by random() is notoriously slow. It has to take all the rows and then order them: if I have 1 million rows and many concurrent reads/writes, I fear it won't meet my requirements. Can't find on stackoverflow now posts about this but I read that it's super slow: example: http://www.gab.lc/articles/bigdata_postgresql_order_by_random – Mathieu Oct 25 '15 at 16:18
  • @araqnid see here http://stackoverflow.com/questions/16777688/optimizing-slow-order-by-rand-query => "ORDER BY RAND() is slow because the DBMS has to read all rows, sort them all, just to keep only a few rows. So the performance of this query heavily depends on the number of rows in the table, and decreases as the number of rows increase." another article about slow ness of order by random on large datasets: https://www.periscopedata.com/blog/how-to-sample-rows-in-sql-273x-faster.html – Mathieu Oct 25 '15 at 16:20
  • if you add the table definition to your question (in psql: `\d opportunities`), we could at least reproduce your problem. – wildplasser Oct 25 '15 at 17:02
  • moved to another question that is more precise: http://stackoverflow.com/questions/33333021/postgresql-9-4-fastest-query-to-select-and-update-on-large-dataset-30m-rows. see my EDIT – Mathieu Oct 25 '15 at 18:05

1 Answers1

2

Just an idea: instead of calling random() use it as default value for a column(which can be indexed) A similar way could use a serial with an increment of about 0.7 * INT_MAX.

\i tmp.sql

CREATE TABLE opportunities
    ( id SERIAL NOT NULL PRIMARY KEY
    , deal_id INTEGER NOT NULL DEFAULT 0
    , prize_id INTEGER
    , opportunity_available boolean NOT NULL DEFAULT False
            -- ----------------------------------------
            -- precomputed random() , (could be indexed)
    , magic DOUBLE precision NOT NULL default RANDOM()
    );

INSERT INTO opportunities(deal_id)
SELECT 341
FROM generate_series(1,20) gs
    ;
VACUUM ANALYZE opportunities;

PREPARE add_three (integer) AS (
WITH zzz AS (
  UPDATE opportunities
  SET prize_id = 21
    , opportunity_available = True
    -- updating magic is not *really* needed here ...
    , magic = random()
  WHERE opportunities.id
  IN (
    SELECT opportunities.id
    FROM opportunities
    WHERE (deal_id = $1 AND prize_id IS NULL)
    -- ORDER BY RANDOM()
    ORDER BY magic
    LIMIT 3)
RETURNING id, magic
    ) -- 
SELECT * FROM zzz
    );

PREPARE draw_one (integer) AS (
  WITH upd AS (
  UPDATE opportunities s
  SET    opportunity_available = false
  FROM  (
     SELECT id
     FROM   opportunities
     WHERE  deal_id = $1
     AND    opportunity_available
     AND    pg_try_advisory_xact_lock(id)
     ORDER BY magic
     LIMIT  1

     FOR    UPDATE
     ) sub
  WHERE     s.id = sub.id
  RETURNING s.prize_id, s.id, magic
    )
SELECT * FROM upd
    );

SELECT * FROM opportunities;

\echo add3
EXECUTE add_three(341);
SELECT * FROM opportunities;

\echo add3 more
EXECUTE add_three(341);
SELECT * FROM opportunities;

\echo draw1
EXECUTE draw_one(341);
SELECT * FROM opportunities;

\echo draw2
EXECUTE draw_one(341);
SELECT * FROM opportunities;

VACUUM ANALYZE opportunities;

\echo draw3
EXECUTE draw_one(341);
SELECT * FROM opportunities;

\echo draw4
EXECUTE draw_one(341);
SELECT * FROM opportunities;
wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • thanks for your answer. I'm quite new to postgresql .There is something I don't totally get: what is the advantage of creating a new column (random()) and put an index on it VERSUS querying the id column (primary and could have an index too): both are numbers, both have "holes"/gaps between values...why do we really need to create a new column vs using the existing id column , Is is because the query will be much faster to ORDER float values between 0 and 1 (created by random(), than to ORDER integer values between 1 and 1 million ? – Mathieu Oct 27 '15 at 20:50
  • Sorry, it appears you are are a leacher. it *could be* that I am going to downvote you for the rest of your life. It could be not. Be prepared ... – wildplasser Oct 27 '15 at 22:59