PostgreSQL: Random Selection?

Question

I'm working on PostgreSQL with PHP.

Is it possible to select a particular number of random values in a column FROM table WHERE condition

instead of

select column FROM table WHERE condition

then convert them into an array and use array_rand()?

(I do not want to use this way because I will have millions of rows and selecting all values first then array_rand() is probably going to take a lot of time.)

Let's say I have a table like this:

   name    |   items
-----------+------------
    Ben    | {dd,ab,aa}  
-----------+------------
   David   |  {dd,aa}  
-----------+------------
   Bryan   | {aa,ab,cd}
-----------+------------
    Glen   |    {cd}
-----------+------------
   Edward  |   {aa,cd}
-----------+------------
   Aaron   |  {dd,aa}
-----------+------------
  ..... (many many more)

Updates:

And I need to select the 10 random values in a column (or basically the 10 random rows) that match a condition (in this case would be @> ARRAY[aa]) without a sequential table scan or something that is time-consuming.

order by random() is going to take a lot of time as it has to deal with every row so I will go with a better solution instead.

`order by random()` is a perfectly valid solution, but its performance is poor and degrading progressively with "millions of rows", as the `random()` function has to be evaluated of *every* row and then *all* rows have to be sorted before 10 can be returned. A performance nightmare! If you disclose more information about your case, I may be able to provide a *much faster* solution. I posted one recently [here](http://stackoverflow.com/a/8675160/939860). — Erwin Brandstetter, Apr 13 '12 at 20:40
http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/ — Brendan Long, Apr 14 '12 at 03:17
@ErwinBrandstetter Thank you! Yes, I agree with you. `random()` is not a good idea. I've gone through the solution for [select-random-rows-postgresql](http://stackoverflow.com/questions/8674718/best-way-to-select-random-rows-postgresql/8675160#8675160) and it is pretty smart. However my case is different as I'm selecting random rows that match a condition instead of just any random row. I've elaborated my question :) and it would be great if you can help. — , Apr 14 '12 at 03:24
@BrendanLong Thanks but I would need to select random rows that match a condition so I would need a `while loop` for it if I go with the random id method and I am not sure how to do that. Or maybe there is a better solution. — , Apr 14 '12 at 03:39
@ArchJ: Is your condition stable (always the exact same expression)?. Also, please add information about your primary key and whether there are gaps in the numbering. — Erwin Brandstetter, Apr 14 '12 at 03:46
@ErwinBrandstetter Sure! Condition is stable and there is no gaps in the numbering for my primary key. — , Apr 14 '12 at 03:48
So its always `@> ARRAY[aa]` and never `@> ARRAY[bb]`? And how many rows do you expect to match the condition approx.? — Erwin Brandstetter, Apr 14 '12 at 03:50
@ErwinBrandstetter Oh. Sorry. Up to hundreds millions. lol... — , Apr 14 '12 at 04:00
Is the main table read-only? Or does it get updated / inserted? A lot? — Erwin Brandstetter, Apr 14 '12 at 04:02
No. Updated and inserted by users regularly. Approx few times a day. — , Apr 14 '12 at 04:04

score 6 · Answer 1 · answered Apr 13 '12 at 13:58

6

In PostgreSQL, you can order by random(), so this should be what you want:

select name
from table
where items @>ARRAY['aa']
order by random()
limit 10;

answered Apr 13 '12 at 13:58

Thank you but like what Erwin has said its performance is poor as it would have to be evaluated of every row and then all rows have to be sorted. Sorry but I can't accept your answer. – Apr 14 '12 at 02:56

score 1 · Accepted Answer · edited May 23 '17 at 12:04

This solution is good if the base table is not heavily updated. Else maintenance costs may exceed the gain.

If your condition is always @> ARRAY[aa], you could create an auxiliary look-up table (basically a materialized view).

CREATE TABLE tbl_pick (rn serial, id int, PRIMARY KEY (rn, id);

INSERT INTO tbl_pick (id)
SELECT id FROM tbl
WHERE  items @> ARRAY[aa];

Then you can apply a similar method as describe here:

SELECT *
FROM  (
    SELECT 1 + floor(random() * <insert_count_plus_a_bit_here>)::integer AS rn
    FROM   generate_series(1, 20) g
    GROUP  BY 1                     -- trim duplicates
    ) r
JOIN   tbl_pick USING (rn)
JOIN   tbl USING (id)
LIMIT  10;                          -- trim surplus

This should be very fast, because it only needs index scans and only ~ 10 rows are read from the table(s).

Of course, you have to update tbl_pick after (relevant) INSERT / DELETE / UPDATE to tbl. Small numbers of Updates can just be added / deleted (no updates) to tbl_pick, because there is some wiggle-room in the method. After a certain number of updates you would TRUNCATE and rerun the complete INSERT. Basically rewrite your materialized view.

UPDATEs and DELETEs could be cascaded to tbl_pick with foreign key constraints with ON UPDATE CASCADE ON DELETE CASCADE. And a trigger AFTER INSERT for newly inserted rows. It all depends on what is possible in the base table.

And schedule a complete rewrite of tbl_pick in regular intervals, preferably during off hours.

If your random select queries come in bursts, it might be cheaper to have a "variable" indicating whether tbl_pick is dirty (instead of fk constraints and trigger) and only refill the table in such a case before you run your query (multiple times). This very much depends on your use patterns. This "variable" could be a one-row table, where only UPDATE is allowed. Set it to TRUE after (relevant) updates to tbl, to FALSE after you have refreshed the materialized view.

PostgreSQL: Random Selection?

Updates:

2 Answers2

Linked