1

I'm looking for a way to join these two queries (or run these two together):

SELECT  s
FROM    generate_series(1, 50) s;

With this query:

SELECT id FROM foo ORDER BY RANDOM() LIMIT 50;

In a way where I get 50 rows like this:

series, ids_from_foo
1, 53
2, 34
3, 23

I've been at it for a couple days now and I can't figure it out. Any help would be great.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
newUserNameHere
  • 17,348
  • 18
  • 49
  • 79

2 Answers2

2

Use row_number()

select row_number() over() as rn, a
from (
    select a
    from foo
    order by random()
    limit 50
) s
order by rn;
Clodoaldo Neto
  • 118,695
  • 26
  • 233
  • 260
2

Picking the top n rows from a randomly sorted table is a simple, but slow way to pick 50 rows randomly. All rows have to be sorted that way.

Doesn't matter much for small to medium tables and one-time, ad-hoc use. For repeated use on a big table, there are much more efficient ways. If the ratio of gaps / island in the primary key is low, use this:

SELECT row_number() OVER() AS rn, *
FROM  (
   SELECT *
   FROM  (
       SELECT trunc(random() * 999999)::int AS foo_id
       FROM   generate_series(1, 55) g
       GROUP  BY 1                     -- fold duplicates
       ) sub1
   JOIN   foo USING (foo_id)
   LIMIT  50
   ) sub2;

With an index on foo_id, this blazingly fast, no matter how big the table. (A primary key serves just fine.) Compare performance with EXPLAIN ANALYZE.

How?

999999 is an estimated row count of the table, rounded up. You can get it cheaply from:

SELECT reltuples FROM pg_class WHERE oid = 'foo'::regclass;

Round up to easily include possible new entries since the last ANALYZE. You can also use the expression itself in a generic query dynamically, it's cheap. Details:

55 is your desired number of rows (50) in the result, multiplied by a low factor to easily make up for the gap ratio in your table and (unlikely but possible) duplicate random numbers.

If your primary key does not start near 1 (does not have to be 1 exactly, gaps are covered), add the minimum pk value to the calculation:

min_pkey + trunc(random() * 999999)::int

Detailed explanation here:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • The MVCC row count issue used to drive me nuts when I first migrated to Postgres from MySQL. Now MySQL drives me nuts. Good explanation (in the other post too). How can any RDBMS live without a generate_series equivalent would be another interesting question. – John Powell Aug 29 '14 at 19:53
  • @JohnBarça 'generate_series' can be emulated by selecting row number from any large enough table. Also there are othe DB specific ways to replace `generate_series` like `connect by level <=50` in Oracle. – Ihor Romanchenko Aug 29 '14 at 21:16
  • @IgorRomanchenko. True, but selecting rows from existing tables isn't really an equivalent solution and obviously you can do anything in Oracle if you are prepared to pay for it. My comments were more directed at SQL Server and MySQL. – John Powell Aug 29 '14 at 21:25