0

I want to randomly select 20 rows from a large table and use the following query that works fine:

SELECT id
FROM timeseriesentry
WHERE random() < 20*1.0/12940622

(12940622 is the number of rows in the table). I now want to retrieve the number of rows automatically and use

WITH tmp AS (SELECT COUNT(*) n FROM timeseriesentry)
SELECT id
FROM timeseriesentry, tmp
WHERE random() < 20*1.0/n

which yields zero rows even though n is correct.

What am I missing here?

Edit: id is not numerical which is why I can't create a random series to select from it. I need the proposed structure because my actual goal is

WITH npt AS (
    SELECT type, COUNT(*) n
    FROM timeseriesentry
    GROUP BY type
)
SELECT v.id
FROM timeseriesentry v
JOIN npt ON npt.type= v.type
WHERE random() < 200*1.0/npt.n

which forces roughly the same amount of samples per type.

Vao Tsun
  • 47,234
  • 13
  • 100
  • 132
ratatosk
  • 373
  • 5
  • 19
  • `... WHERE random() < 20*1.0/n` -->> `... WHERE random() < 20*1.0/tmp.n` – wildplasser Mar 11 '15 at 11:25
  • That does not help. As long, as the column names are unique, the table name can be omitted. – ratatosk Mar 11 '15 at 11:35
  • 1
    The formula `random() < 20.0/n` does not guarantee to return any rows at all. This question is a good start on the topic: http://stackoverflow.com/questions/8674718/best-way-to-select-random-rows-postgresql – pozs Mar 11 '15 at 11:55
  • It can be simplified to `20.0 / n`. Does it work if it is `1000.0 / n`? – Clodoaldo Neto Mar 11 '15 at 11:57
  • BTW the `cross join` in `FROM timeseriesentry, tmp` may add to the query cost. Do `WHERE random() < 20.0 / (SELECT COUNT(*) n FROM timeseriesentry)` in instead. – Clodoaldo Neto Mar 11 '15 at 12:01
  • The naive version with a scalar subquery does work: `SELECT * FROM timeseriesentry WHERE random() < 20.0 / ( SELECT COUNT(*) FROM timeseriesentry) ; ` (it is not intended to be fast, BTW) – wildplasser Mar 11 '15 at 12:05
  • @wildplasser Any idea why the naive version works but not the IF clause? I simplified my problem because my actual problem is a bit more complicated and doesn't allow for the naive version. – ratatosk Mar 11 '15 at 14:49
  • @pozs My query is build after reading the question you linked. Unfortunately, the solutions there either need a numerical id or use ORDER BY random() which is way to slow for this table. The query is not guaranteed any resulting rows but with the given number of tried, I should have died from a few million lightning strikes by now in terms of probability. – ratatosk Mar 11 '15 at 14:52
  • 1
    If you insist on the CTE, the following ugly hybrid form does work : `WITH zzz AS ( SELECT COUNT(*) AS cnt FROM timeseriesentry) SELECT * FROM timeseriesentry WHERE random() < 20.0 / (select cnt from zzz);` – wildplasser Mar 11 '15 at 14:57
  • @wildplasser that works and I would accept it as an answer. Can you give me a pointer why this works? – ratatosk Mar 11 '15 at 15:05
  • I don't know why the initial version doesn't work. Could be caused by the count being NULLable or absent. (but that would be the same in the scalar subquery case) – wildplasser Mar 11 '15 at 15:15

3 Answers3

1

This is ugly, but it works. It also avoids the identifier type, which is an (unreserved) keyword.

WITH zzz AS (
        SELECT ztype
        , COUNT(*) AS cnt
         FROM timeseriesentry
        GROUP BY ztype)
SELECT *
FROM timeseriesentry src
WHERE  random() < 20.0 / (SELECT cnt FROM zzz
                        WHERE zzz.ztype = src.ztype)
ORDER BY src.ztype
        ;

UPDATE: the same with a window function in a subquery:

SELECT *
FROM    (SELECT *
        , sum(1) OVER (PARTITION BY ztype) AS cnt
        FROM timeseriesentry
        ) src
WHERE random() < 20.0 / src.cnt
ORDER BY src.ztype
        ;

Or, a bit more compact, the same thing, but using a CTE:

WITH src AS(SELECT *
        , sum(1) OVER (PARTITION BY ztype) AS cnt
        FROM timeseriesentry
        ) 
SELECT *
FROM src
WHERE random() < 20.0 / src.cnt
ORDER BY src.ztype
 ;

Beware: the CTE-versions are not necessarily equal in performance. In fact they often are slower. (since the OQ actually needs to visit all the rows of the timeseriesentry table in either case, there will not be much difference in this particular case)

wildplasser
  • 43,142
  • 8
  • 66
  • 109
1

I created a table with no numeric field:

create table timeseriesentry as select generate_series('2015-01-01'::timestamptz,'2015-01-02'::timestamptz,'1 second'::interval) id, 'ret'::text v
;

and reused window aggregation:

WITH tmp AS (SELECT round(count(*) over()*random()) n FROM timeseriesentry limit 20)
select id from 
(SELECT row_number() over() rn,id
FROM timeseriesentry
) sel, tmp
WHERE rn =n
;

so it gives "random" 20:

2015-01-01 01:27:22+01
2015-01-01 03:33:51+01
2015-01-01 06:15:28+01
2015-01-01 09:52:21+01
2015-01-01 10:00:02+01
2015-01-01 10:08:33+01
2015-01-01 10:26:31+01
2015-01-01 12:55:21+01
2015-01-01 14:03:54+01
2015-01-01 14:05:36+01
2015-01-01 15:12:08+01
2015-01-01 15:45:55+01
2015-01-01 16:10:35+01
2015-01-01 17:11:02+01
2015-01-01 18:18:32+01
2015-01-01 19:35:51+01
2015-01-01 22:06:08+01
2015-01-01 22:12:42+01
2015-01-01 22:43:45+01
2015-01-01 22:49:55+01
Vao Tsun
  • 47,234
  • 13
  • 100
  • 132
0

I guess closest I can offer:

WITH tmp AS (SELECT round(count(*) over()*random()) n FROM timeseriesentry limit 20)
SELECT id
FROM timeseriesentry, tmp
WHERE id=n
Vao Tsun
  • 47,234
  • 13
  • 100
  • 132
  • This would be nice. Unfortunately, I don't have a numerical key in this table, so that's no option. – ratatosk Mar 11 '15 at 14:45