1

I want to select one row among a large dataset (>30Million rows) with heavy writes/reads RANDOMLY.

My problem I can't let the arbitrary pick to postgresql (that would have been the cheapest / fastest query, ust using 'limit 1') as it behaves erratically and in "obscure ways": see my initial problem here: postgresql 9.4 - prevent app selecting always the latest updated rows

Here is my current query

UPDATE opportunities s
SET    opportunity_available = false
FROM  (
   SELECT id
   FROM   opportunities
   WHERE  deal_id = #{@deal.id}
   AND    opportunity_available
   AND    pg_try_advisory_xact_lock(id)
   LIMIT  1
   FOR    UPDATE
   ) sub
WHERE     s.id = sub.id
RETURNING s.prize_id, s.id;
// inspired by https://stackoverflow.com/questions/33128531/put-pg-try-advisory-xact-lock-in-a-nested-subquery

I asked a first question (postgresql 9.4 - prevent app selecting always the latest updated rows) but i think that even if there is no clear answer, what is happening is that postgresql is left free to make a arbitrary pick (as I only use 'LIMIT 1' because I wanted the cheapest/fastest query), which is VERY DIFFERENT from a RANDOM pick. But as a consequence, postgresql often outputs the latest rows updated by the administrator (which are always opportunities that have all prizes), versus really choosing randomly the row.

I think I need to move away from the arbitrary pick to get a RANDOM pick.

In that context what is the best choice i.e the fastest to select (notice the 'FOR UPDATE' and 'advisory locks' as i need to lock rows when they are being updated with for update to prevent concurrent calls...I'll use soon in postgresql 9.5 skip locked as soon as 9.5 goes out of beta)

  • Use order with random() but it is notoriously (read many many posts on on stackoverflow and stack exchange dba about this) to be REALLY really slow on large datasets => "ORDER BY RAND() is slow because the DBMS has to read all rows, sort them all, just to keep only a few rows. So the performance of this query heavily depends on the number of rows in the table, and decreases as the number of rows increase.", as explained here or here

  • Use offset is also know to be slow for large datasets

  • Use Sampling like explained/advised here by what seem big experts: https://www.periscopedata.com/blog/how-to-sample-rows-in-sql-273x-faster.html

  • Use another advanced technique you might suggest

Community
  • 1
  • 1
Mathieu
  • 4,587
  • 11
  • 57
  • 112
  • Are there gaps in the ids? – Jakub Kania Oct 25 '15 at 18:10
  • yes well if I check the column id (the primary key) it's like 1,3,4,5,6 45, 47,... so if that's what you mean, yes some ids can be absent (this is because after creating it, the administrator can, even if not often, delete them so their ids would be missing afterwards)./ – Mathieu Oct 25 '15 at 18:21
  • 3
    IMHO there is no real differnce to the first question. Even the DDL is again absent. – wildplasser Oct 25 '15 at 18:24
  • but comments were getting lost as my question was not targeted enough: now it's about discussing ways to move from ARBITRARY to RANDOM pick. – Mathieu Oct 25 '15 at 18:32
  • 1
    I was working on the first one, now I've lost appetite. Observation: you are too much obsessed with speed and performance. IMnsHO you should focus on correctness first. [and there still is no DDL.] – wildplasser Oct 25 '15 at 18:37
  • 2
    @Mathieu Have you looked at [the solutions offered here](http://stackoverflow.com/questions/5297396/quick-random-row-selection-in-postgres)? – Schwern Oct 25 '15 at 18:38
  • @Schwern nope but seems very interesting, especially the latest answer about postgresql 9.5 TABLEPLAN even if i need to select each time only one row, thanks – Mathieu Oct 25 '15 at 18:47
  • See my new very fast solution on original answer: http://stackoverflow.com/a/33334193/15862 – Tometzky Oct 25 '15 at 19:38
  • 1
    Your question would be more useful with the actual table definition showing data types and constraints (what you get in psql with `\d opportunities` or a complete `CREATE TABLE` script) and information about data distribution in `id` (*in the question*). In particular: what's the ratio of gaps / islands in the number space? – Erwin Brandstetter Oct 28 '15 at 11:27

0 Answers0