Postgresql 9.4 - FASTEST query to select and update on large dataset (>30M rows) with heavy writes/reads and locks

Question

I want to select one row among a large dataset (>30Million rows) with heavy writes/reads RANDOMLY.

My problem I can't let the arbitrary pick to postgresql (that would have been the cheapest / fastest query, ust using 'limit 1') as it behaves erratically and in "obscure ways": see my initial problem here: postgresql 9.4 - prevent app selecting always the latest updated rows

Here is my current query

UPDATE opportunities s
SET    opportunity_available = false
FROM  (
   SELECT id
   FROM   opportunities
   WHERE  deal_id = #{@deal.id}
   AND    opportunity_available
   AND    pg_try_advisory_xact_lock(id)
   LIMIT  1
   FOR    UPDATE
   ) sub
WHERE     s.id = sub.id
RETURNING s.prize_id, s.id;
// inspired by https://stackoverflow.com/questions/33128531/put-pg-try-advisory-xact-lock-in-a-nested-subquery

I asked a first question (postgresql 9.4 - prevent app selecting always the latest updated rows) but i think that even if there is no clear answer, what is happening is that postgresql is left free to make a arbitrary pick (as I only use 'LIMIT 1' because I wanted the cheapest/fastest query), which is VERY DIFFERENT from a RANDOM pick. But as a consequence, postgresql often outputs the latest rows updated by the administrator (which are always opportunities that have all prizes), versus really choosing randomly the row.

I think I need to move away from the arbitrary pick to get a RANDOM pick.

In that context what is the best choice i.e the fastest to select (notice the 'FOR UPDATE' and 'advisory locks' as i need to lock rows when they are being updated with for update to prevent concurrent calls...I'll use soon in postgresql 9.5 skip locked as soon as 9.5 goes out of beta)

Use order with random() but it is notoriously (read many many posts on on stackoverflow and stack exchange dba about this) to be REALLY really slow on large datasets => "ORDER BY RAND() is slow because the DBMS has to read all rows, sort them all, just to keep only a few rows. So the performance of this query heavily depends on the number of rows in the table, and decreases as the number of rows increase.", as explained here or here
Use offset is also know to be slow for large datasets
Use Sampling like explained/advised here by what seem big experts: https://www.periscopedata.com/blog/how-to-sample-rows-in-sql-273x-faster.html
Use another advanced technique you might suggest

yes well if I check the column id (the primary key) it's like 1,3,4,5,6 45, 47,... so if that's what you mean, yes some ids can be absent (this is because after creating it, the administrator can, even if not often, delete them so their ids would be missing afterwards)./ — Mathieu, Oct 25 '15 at 18:21
IMHO there is no real differnce to the first question. Even the DDL is again absent. — wildplasser, Oct 25 '15 at 18:24
but comments were getting lost as my question was not targeted enough: now it's about discussing ways to move from ARBITRARY to RANDOM pick. — Mathieu, Oct 25 '15 at 18:32
I was working on the first one, now I've lost appetite. Observation: you are too much obsessed with speed and performance. IMnsHO you should focus on correctness first. [and there still is no DDL.] — wildplasser, Oct 25 '15 at 18:37
@Mathieu Have you looked at [the solutions offered here](http://stackoverflow.com/questions/5297396/quick-random-row-selection-in-postgres)? — Schwern, Oct 25 '15 at 18:38
@Schwern nope but seems very interesting, especially the latest answer about postgresql 9.5 TABLEPLAN even if i need to select each time only one row, thanks — Mathieu, Oct 25 '15 at 18:47
See my new very fast solution on original answer: http://stackoverflow.com/a/33334193/15862 — Tometzky, Oct 25 '15 at 19:38
Your question would be more useful with the actual table definition showing data types and constraints (what you get in psql with `\d opportunities` or a complete `CREATE TABLE` script) and information about data distribution in `id` (*in the question*). In particular: what's the ratio of gaps / islands in the number space? — Erwin Brandstetter, Oct 28 '15 at 11:27

Postgresql 9.4 - FASTEST query to select and update on large dataset (>30M rows) with heavy writes/reads and locks

0 Answers0

Linked