POSTGRES select n equally distributed rows by time over millions of records

Question

I have a table with columns id,filter1,filter2,time,value which contains millions of records. I want to fetch n equally distributed rows between two timestamps. If the number of records between timestamps is less than n I want to fetch all the records.

My current query looks like below, assuming n=200

SELECT s.* FROM (
    SELECT t.time, t.value,
           ROW_NUMBER() OVER(ORDER BY t.time) as rnk,
           COUNT(*) OVER() as total_cnt
    FROM table_name t
    WHERE t.filter1='filter_value' 
    and t.filter2='another_value' 
    and t.time between '2020-04-18' AND '2020-04-19') s

WHERE MOD(s.rnk,(total_cnt/200)) = 0 ;

I have an index on 'filter1,filter2,time'. Still this query is extremely slow when there is around 10 million records.

I have also tried TABLESAMPLE but I couldn't come up with an appropriate condition for percentage which is fast enough and also returns all rows when the number of rows is less.

`n equally distributed rows between two timestamps` - equally distributed over time? any requirements in regard to ties (same timestamp)? or just a random / arbitrary selection of rows? — Erwin Brandstetter, Apr 23 '20 at 13:56
`ORDER BY random() LIMIT n` is not good enough? Reservoir sampling seems hard to do in pure SQL. — wildplasser, Apr 23 '20 at 16:37
@ErwinBrandstetter I need a decent spread across the entire data set. Doesn't need exactly equal distribution though. I've tried `TABLESAMPLE` but it does not provide much performance improvement — Manu Viswam, Apr 24 '20 at 13:54
@wildplasser `ORDER BY random() LIMIT n` has duplicates and is not well distributed all the time. — Manu Viswam, Apr 24 '20 at 13:55
"Decent spread" with regard to physical location? point in time? some other criteria? what about possible duplicates? Please be specific. Random / arbitrary samples heavily depend on actual requirements. — Erwin Brandstetter, Apr 24 '20 at 19:33
@ErwinBrandstetter . . . Based on the code, it would be equally distributed based on `time`. — Gordon Linoff, Apr 24 '20 at 21:51
@GordonLinoff: I'll go with that. But I would have liked more information ... — Erwin Brandstetter, Apr 25 '20 at 00:02

Erwin Brandstetter · Answer 1 · 2020-04-26T02:00:40.420

If ...

... you have no additional meta information about logical or physical data distribution
... and need the selection to be spread out equally over time

... then your original query is basically as good as it gets. You have the index on (filter1,filter2,time) like Gordon suggested. It helps (a lot) if less than a few percent pass the filters. We then have to count and number all qualifying rows (the expensive part for many qualifying rows) to get a strictly uniform distribution in the sample.

A few minor suggestions:

SELECT s.*
FROM  (
   SELECT t.time, t.value
        , row_number() OVER (ORDER BY t.time) AS rn  -- ①
        , count(*) OVER() AS total_cnt
   FROM   table_name t
   WHERE  t.filter1 = 'filter_value' 
   AND    t.filter2 = 'another_value' 
   AND    t.time >= '2020-04-18'  -- assuming data type timestamp!
   AND    t.time <  '2020-04-20'  -- ②
   ) s
WHERE  mod(s.rn, total_cnt/n) = total_cnt/n/2 + 1;  -- ③

① Use the column alias rn (or whatever) for row_number(); rnk would hint at rank().

② Assuming the column "time" is data type timestamp as neither date nor time would make sense. ("time" seems misleading.) So this predicate is most likely wrong:

t.time between '2020-04-18' AND '2020-04-19'

The given date literals are coerced to timestamps 2020-04-18 0:0 / 2020-04-19 0:0. Since BETWEEN includes lower and upper bound the filter effectively selects all of 2020-04-18 plus the first instant of 2020-04-19. Hardly ever makes sense. My suggested fix includes all of 2020-04-18 and 2020-04-19.

If the column "time" is data type timestamptz, then the above basically applies as well. Plus, you add a dependency on the timezone setting of the database session into the mix. Don't! See:

③ Your original condition MOD(s.rnk,(total_cnt/n)) = 0 picks every total_cnt/n-th row, always skipping the first total_cnt/n - 1 rows, which creates a bias for later rows. To illustrate:

ooooXooooXooooXooooX

My alternative shifts the selection towards the center, which seems more reasonable:

ooXooooXooooXooooXoo

Integer division might produce 0. Adding 1 (total_cnt/n/2 + 1) prevents that from happening. Plus it's more in the "center" in any case.

Finally, it should be mentioned that the result for equal values in time is arbitrary. You might want to define a tiebreaker if that matters ...

That said, we might be able to use any meta information about data distribution to our advantage. Or if we can relax requirements for strictly uniform distribution in the sample (to what degree?).

Radically faster with only index scans

If we can assume uniform data distribution over time for all (or some) combinations of (filter1, filter2) we can just partition the time interval and get away with n very cheap index (only) scans. (Or if we don't care about uniform data distribution too much, we might do it anyway.) To illustrate:

WITH input (f1    , f2    , lo                    , hi                    , n) AS (
   VALUES  ('val2', 'val2', timestamp '2020-04-18', timestamp '2020-04-20', 200)
   )
SELECT g.lo, s.*
FROM   (SELECT *, (hi - lo) / n AS span FROM input) i
CROSS  JOIN generate_series(lo, hi - span, span) g(lo)
LEFT   JOIN LATERAL (   
   SELECT t.time, t.value
   FROM   table_name t
   WHERE  t.filter1 = i.f1
   AND    t.filter2 = i.f2
   AND    t.time >= g.lo
   AND    t.time <  g.lo + span
   ORDER  BY time
   LIMIT  1
   ) s ON true;

This is just a proof of concept, which can be tweaked in a hundred and one ways. There is a lot going on in this query and not enough information about the case to streamline.

The main objective is to avoid processing all rows, and only fetch the ones to return.

The query starts at the lower bound, producing a selection pattern like:

XooooXooooXooooXoooo

The LEFT JOIN keeps empty time slices in the result, which indicate non-uniform data distribution.

Any kind of meta information on table design, data distribution, write patterns etc. might be used to optimize further. Indexing might be optimized: index-only scans, partial indexes, ...

. . If only there were a way to do the sampling from the index before the data pages were fetched, then the query could probably be sped up. This is a case where a clustered index would be quite helpful. — Gordon Linoff, Apr 25 '20 at 13:55
@GordonLinoff: yes, would be perfect for the use case. there *are* ways to emulate this in Postgres, though. I added a proof of concept. — Erwin Brandstetter, Apr 26 '20 at 01:55
@ErwinBrandstetter Thank you for the good suggestions. Time is actually `timestamptz` and I've changed my query to use `BETWEEN`. I will test the join based query to see the performance imrpovement — Manu Viswam, Apr 27 '20 at 12:45

score 0 · Answer 2 · answered Apr 23 '20 at 10:46

0

For this query:

SELECT s.*
FROM (SELECT t.time, t.value,
             ROW_NUMBER() OVER (ORDER BY t.time) as rnk,
             COUNT(*) OVER () as total_cnt
      FROM table_name t
      WHERE t.filter1 = 'filter_value' AND
            t.filter2 = 'another_value' AND
            t.time between '2020-04-18' AND '2020-04-19'
     ) s
WHERE MOD(s.rnk, (total_cnt / 200)) = 0 ;

You want an index on (filter1, filter2, time). This should help performance.

answered Apr 23 '20 at 10:46

Gordon Linoff

1,242,037
58
646
786

1

@ManuViswam . . . A single index with all three values or three separate indexes? – Gordon Linoff Apr 23 '20 at 13:54
Single index with all three values in the given order – Manu Viswam Apr 24 '20 at 13:51
@ManuViswam . . . Any possibility you have inconsistent data types -- including different collations -- that would preclude the use of an index? – Gordon Linoff Apr 24 '20 at 13:57
I've done `EXPLAIN ANALYZE` and verified that the query is using the index. Without the index query takes around 80 seconds and with index it takes around 30 seconds. I need to bring it down to below 10 seconds. – Manu Viswam Apr 24 '20 at 14:02

POSTGRES select n equally distributed rows by time over millions of records

2 Answers2

Radically faster with only index scans

Linked