116

I have a table in postgres that contains couple of millions of rows. I have checked on the internet and I found the following

SELECT myid FROM mytable ORDER BY RANDOM() LIMIT 1;

It works, but it's really slow... is there another way to make that query, or a direct way to select a random row without reading all the table? By the way 'myid' is an integer but it can be an empty field.

Peter O.
  • 32,158
  • 14
  • 82
  • 96
Juan
  • 1,520
  • 2
  • 19
  • 31
  • 2
    If you want to select multiple random rows, see this question: https://stackoverflow.com/q/8674718/247696 – Flimm May 30 '18 at 08:27

8 Answers8

120

You might want to experiment with OFFSET, as in

SELECT myid FROM mytable OFFSET floor(random() * N) LIMIT 1;

The N is the number of rows in mytable. You may need to first do a SELECT COUNT(*) to figure out the value of N.

Update (by Antony Hatchkins)

You must use floor here:

SELECT myid FROM mytable OFFSET floor(random() * N) LIMIT 1;

Consider a table of 2 rows; random()*N generates 0 <= x < 2 and for example SELECT myid FROM mytable OFFSET 1.7 LIMIT 1; returns 0 rows because of implicit rounding to nearest int.

Andrii Abramov
  • 10,019
  • 9
  • 74
  • 96
NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • 1
    make It sense to use a N less than `SELECT COUNT(*)`?, I mean, not use all the values in the table but only a part of them? – Juan Mar 14 '11 at 11:00
  • @Juan That depends on your requirements. – NPE Mar 14 '11 at 11:09
  • using the `EXPLAIN SELECT ...` with different values of N give the same cost for the query, then I guess is better to go for the maximum value of N. – Juan Mar 14 '11 at 11:32
  • 4
    see a bugfix in my answer below – Antony Hatchkins Oct 26 '12 at 08:53
  • 2
    This has an off by one error. It will never return the first row and will generate an error 1/COUNT(*) because it will try to return the row after the last row. – Ian Mar 21 '14 at 20:24
  • What if N = total number of records? You will be scanning the whole table. – tkhuynh Jul 24 '17 at 18:47
77

PostgreSQL 9.5 introduced a new approach for much faster sample selection: TABLESAMPLE

The syntax is

SELECT * FROM my_table TABLESAMPLE BERNOULLI(percentage);
SELECT * FROM my_table TABLESAMPLE SYSTEM(percentage);

This is not the optimal solution if you want only one row selected, because you need to know the COUNT of the table to calculate the exact percentage.

To avoid a slow COUNT and use fast TABLESAMPLE for tables from 1 row to billions of rows, you can do:

 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.000001) LIMIT 1;
 -- if you got no result:
 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.00001) LIMIT 1;
 -- if you got no result:
 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.0001) LIMIT 1;
 -- if you got no result:
 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.001) LIMIT 1;
 ...

This might not look so elegant, but probably is faster than any of the other answers.

To decide whether you want to use BERNULLI oder SYSTEM, read about the difference at https://www.2ndquadrant.com/en/blog/tablesample-in-postgresql-9-5-2/

Harmon758
  • 5,084
  • 3
  • 22
  • 39
alfonx
  • 6,936
  • 2
  • 49
  • 58
  • 2
    This is much much faster and easier than any other answer -- this one should be at the top. – Hayden Schiff Sep 25 '17 at 19:03
  • 1
    Why can't you just use a subquery to get the count? `SELECT * FROM my_table TABLESAMPLE SYSTEM(SELECT 1/COUNT(*) FROM my_table) LIMIT 1;`? – machineghost Dec 12 '19 at 21:37
  • 2
    @machineghost "To avoid a slow COUNT..." ... If your data is so small, that you can count in reasonable time, go for it! :-) – alfonx Dec 15 '19 at 10:11
  • 2
    @machineghost Use `SELECT reltuples FROM pg_class WHERE relname = 'my_table'` for count estimation. – Hynek -Pichi- Vychodil May 04 '20 at 15:21
  • @Hynek-Pichi-Vychodil very good input! To ensure that the estimation is not outdated, it has to bee VACUUM ANALYZEd recently.. but a good database should be properly analyzed anyway.. And it all depends on the specific use-case. Usually huge tables don't grow so fast... Thanks! – alfonx May 05 '20 at 08:57
  • @machineghost You need extra brackets, and cast the 1 to a `float`y kind of thing so you get a fractional number returned: `SELECT * FROM my_table TABLESAMPLE SYSTEM((SELECT 1::double precision/COUNT(*) FROM my_table)) LIMIT 1;` – poshest Feb 24 '21 at 16:38
  • `ERROR: TABLESAMPLE clause can only be applied to tables and materialized views`. I was trying to use this in a CTE on a subset of one table's data. Which makes it [pretty limited](https://dba.stackexchange.com/a/258272/92346) in my opinion. – poshest Feb 24 '21 at 16:52
36

I tried this with a subquery and it worked fine. Offset, at least in Postgresql v8.4.4 works fine.

select * from mytable offset random() * (select count(*) from mytable) limit 1 ;
Jim Garrison
  • 85,615
  • 20
  • 155
  • 190
John Coryat
  • 361
  • 3
  • 2
32

You need to use floor:

SELECT myid FROM mytable OFFSET floor(random()*N) LIMIT 1;
Antony Hatchkins
  • 31,947
  • 10
  • 111
  • 111
  • Consider a table of 2 rows; `random()*N` generates 0 <= x < 2 and for example `SELECT myid FROM mytable OFFSET 1.7 LIMIT 1;` returns 0 rows because of implicit rounding to nearest int. – Antony Hatchkins Oct 26 '12 at 08:48
  • Unfortunately this doesn't work if you want to use a higher LIMIT... I need to get 3 items so I need to use the ORDER BY RANDOM() syntax. – Alexis Wilke Nov 24 '12 at 01:29
  • 1
    Three consecutive queries will still be faster than one `order by random()`, approximately as `3*O(N) < O(NlogN)` - reallife figures will be slightly different due to indices. – Antony Hatchkins Nov 24 '12 at 19:27
  • My problem is that the 3 items need to be distinct and a `WHERE myid NOT IN (1st-myid)` and `WHERE myid NOT IN (1st-myid, 2nd-myid)` wouldn't work since the decision is made by the OFFSET. Hmmm... I guess I could reduce N by 1 and 2 in the second and third SELECT. – Alexis Wilke Nov 24 '12 at 22:38
  • Could you or anyone expand this answer with an answer to ***why*** I need to use `floor()`? What advantage does it offer? – ADTC Jul 21 '14 at 11:00
  • `floor(random()*N)` is guaranteed to be 0..N-1 and never N. – Antony Hatchkins May 12 '15 at 19:09
  • This is the correct solution, without floor with N=1, I get numbers between 0 and 1. Sometimes this will give me no rows and sometimes it will give me a single row, I'm assuming because postgres implicitly rounds to the nearest int with OFFSET. – Will Sewell May 16 '15 at 16:19
  • I found out about this the hard way, this should be added to the accepted answer. +1 – Hjalmar Nov 12 '15 at 11:14
  • @Hjalmar that's why I usually read all the answers. It's a common thing here on SO that the accepted answer is not the best one. Added to the accepted answer. – Antony Hatchkins Nov 13 '15 at 04:31
16

Check this link out for some different options. http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/

Update: (A.Hatchkins)

The summary of the (very) long article is as follows.

The author lists four approaches:

1) ORDER BY random() LIMIT 1; -- slow

2) ORDER BY id where id>=random()*N LIMIT 1 -- nonuniform if there're gaps

3) random column -- needs to be updated every now and then

4) custom random aggregate -- cunning method, could be slow: random() needs to be generated N times

and suggests to improve method #2 by using

5) ORDER BY id where id=random()*N LIMIT 1 with subsequent requeries if the result is empty.

Antony Hatchkins
  • 31,947
  • 10
  • 111
  • 111
Kuberchaun
  • 29,160
  • 7
  • 51
  • 59
  • I wonder why they didn't cover OFFSET? Using an ORDER is out of the question just to get a random row. Fortunately, OFFSET is well covered in the answers. – androidguy Oct 14 '17 at 07:28
8

The easiest and fastest way to fetch random row is to use the tsm_system_rows extension :

CREATE EXTENSION IF NOT EXISTS tsm_system_rows;

Then you can select the exact number of rows you want :

SELECT myid  FROM mytable TABLESAMPLE SYSTEM_ROWS(1);

This is available with PostgreSQL 9.5 and later.

See: https://www.postgresql.org/docs/current/static/tsm-system-rows.html

Ben Aubin
  • 5,542
  • 2
  • 34
  • 54
daamien
  • 91
  • 1
  • 2
  • 7
    Fair warning, this isn't completely random. On smaller tables, I've had it always return the first rows in order. – Ben Aubin Aug 26 '18 at 15:35
  • 3
    yes this is clearly explained in documentation (link above) : « Like the built-in SYSTEM sampling method, SYSTEM_ROWS performs block-level sampling, so that the sample is not completely random but may be subject to clustering effects, especially if only a small number of rows are requested. » . If you have a small dataset, the `ORDER BY random() LIMIT 1;` should be fast enough. – daamien Aug 27 '18 at 19:43
  • I saw that. Just wanted to make it clear to anyone who doesn't click the link or if the link dies in the future. – Ben Aubin Aug 28 '18 at 19:52
  • 3
    Also worth noting that this will only work for selecting random rows out of a table and THEN filtering, as opposed/compared to running a query and then picking one or some records at random. – nomen Jun 01 '19 at 22:41
3

I've came up with a very fast solution without TABLESAMPLE. Much faster than OFFSET random()*N LIMIT 1. It doesn't even require table count.

The idea is to create an expression index with random but predictable data, for example md5(primary key).

Here is a test with 1M rows sample data:

create table randtest (id serial primary key, data int not null);

insert into randtest (data) select (random()*1000000)::int from generate_series(1,1000000);

create index randtest_md5_id_idx on randtest (md5(id::text));

explain analyze
select * from randtest where md5(id::text)>md5(random()::text)
order by md5(id::text) limit 1;

Result:

 Limit  (cost=0.42..0.68 rows=1 width=8) (actual time=6.219..6.220 rows=1 loops=1)
   ->  Index Scan using randtest_md5_id_idx on randtest  (cost=0.42..84040.42 rows=333333 width=8) (actual time=6.217..6.217 rows=1 loops=1)
         Filter: (md5((id)::text) > md5((random())::text))
         Rows Removed by Filter: 1831
 Total runtime: 6.245 ms

This query can sometimes (with about 1/Number_of_rows probability) return 0 rows, so it needs to be checked and rerun. Also probabilities aren't exactly the same - some rows are more probable than others.

For comparison:

explain analyze SELECT id FROM randtest OFFSET random()*1000000 LIMIT 1;

Results vary widely, but can be pretty bad:

 Limit  (cost=1442.50..1442.51 rows=1 width=4) (actual time=179.183..179.184 rows=1 loops=1)
   ->  Seq Scan on randtest  (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.016..134.835 rows=915702 loops=1)
 Total runtime: 179.211 ms
(3 rows)
Tometzky
  • 22,573
  • 5
  • 59
  • 73
  • 3
    Fast, yes. Truly random, no. An md5 values that happens to be the next greater value after another existing value has a very slim chance to be picked, while values after a big gap in the number space have a much bigger chance (bigger by the number of possible values in between). The resulting distribution is not random. – Erwin Brandstetter Oct 26 '15 at 00:44
  • very interesting, could it work in a usecase of a lottery-like query: the query must look into all available tickets and randomly return only ONE single ticket. also can I use a pessimistic lock (select...for update) with your technique? – Mathieu Oct 26 '15 at 12:54
  • For anything lottery related you should really use fair and cryptographically secure random sampling - for example pick a random number between 1 and max(id) until you find existing id. The method from this answer is neither fair nor secure - it's fast. Usable for things like 'get random 1% of rows to test something on', or 'show random 5 entries'. – Tometzky Oct 26 '15 at 13:12
0

I added a randomly generated number to each row and generate a random number in my programming language that is added to each row. When calling, I pass a random number to the query (in this case 0.27)

SELECT * FROM
(
  (SELECT id, random FROM t where <condition> and random >= 0.27 ORDER BY random LIMIT 1)
  UNION ALL
  (SELECT id, random FROM t where <condition> and random < 0.27 ORDER BY random DESC LIMIT 1)
) as results
ORDER BY abs(0.27-random) LIMIT 1;

(Query taken from here)

If you have an index here on your the rows in your condition and the random row (containing the random numbers), I get a result in 6 ms on my 8.5 million row table. This is orders of magnitude faster than using anything like order by random().

To improve randomness, you can also generate a new random number for each result you have hit. (Without this some number will occur more often than others.)

Unlike TABLESAMPLE this also supports conditions.

Pux
  • 421
  • 3
  • 18