Is there a faster way than 'ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW'

Question

I've got the following database table:

date        a   b   c   d   add result
23.02.07    A   B   C1  D1  1   1
24.02.07    A   B   C1  D1  0   1
25.02.07    A   B   C1  D1  1   2
26.02.07    A   B   C1  D1  1   3
27.02.07    A   B   C1  D1  1   4
28.02.07    A   B   C1  D1  0   4
01.03.07    A   B   C1  D1  0   4
02.03.07    A   B   C1  D1  0   4
03.03.07    A   B   C1  D1  1   5
04.03.07    A   B   C1  D1  0   5
05.03.07    A   B   C1  D1  0   5
06.03.07    A   B   C1  D1  0   5
07.03.07    A   B   C1  D1  2   7

17.02.07    A   B   C2  D2  1   1
18.02.07    A   B   C2  D2  0   1
19.02.07    A   B   C2  D2  0   1
20.02.07    A   B   C2  D2  0   1
21.02.07    A   B   C2  D2  0   1
22.02.07    A   B   C2  D2  0   1
23.02.07    A   B   C2  D2  0   1
24.02.07    A   B   C2  D2  0   1
25.02.07    A   B   C2  D2  1   2
26.02.07    A   B   C2  D2  3   5
27.02.07    A   B   C2  D2  1   6
28.02.07    A   B   C2  D2  0   6

The column result (last one) is not part of the actual dataset. This column is showcasing what I am trying to achieve. Basically I am summin' up all previous values of "add" and the current one of a given partition by using a analytical function like below:

SUM(add) OVER(PARTITION BY
    A,
    B,
    C,
    D,
    ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS result

This approach works but is slow when querying large collections of data.

Some more insights:

There is a btree index on date (is not a date datatype but a character varying).
All other are character varying instead of add which is an integer.
A and B have a btree index as well.

Is there better / more performant way to do it?

This should already be the fastest approach. You can speed this up with this covering index, which already has the data in the desired order: `create index idx_covering on mytable(a,b,c,d,date,add);`. — Thorsten Kettner, Feb 06 '17 at 16:03

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

Window frame definition

Your ORDER BY is not deterministic (multiple rows have the same date per partition). With the frame definition ROWSBETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, the result of your query can change between calls.

Read the definitions in the manual carefully.

The default frame definition is RANGEBETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which probably makes more sense for your case: It adds all peers of the current row in the sort order (same day in the partition), so you get the same total for all rows in the partition with the same day.

Either make the sort order deterministic (for instance by adding the PK as last ORDER BY item) or switch to RANGE. And since RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is the default, you can then simplify:

SUM(add) OVER(PARTITION BY A, B, C, D ORDER BY date)

Finding if current row is last row to be selected from database

Proper `date` handling

This is not right:

There is a btree index on date (is not a date datatype but a character varying).

That's a mistake with consequences. Never store dates as varchar, store dates as date. Even if ORDER BY date should happen to work correctly with your particular string format ('23.02.07' like you have in the question does not), it is still slower. And error prone. And storage is bigger (ergo also slower, again).

Performance optimization

All that aside, a multicolumn index with index columns in matching order could improve performance (like Thorsten commented):

CREATE INDEX foo ON tbl (A, B, C, D, date, add);

You might cluster the physical table using this index for even faster results:

Optimize Postgres timestamp query range

Details depend on the complete situation: Postgres version, table definition, complete query, ...

Are you sure about the default frame? I assumed PostgreSQL supports Standard SQL's default and your link shows it, too: *The default framing option is RANGE UNBOUNDED PRECEDING*. `RANGE` is usually way more expensive than `ROWS` (even if the result is the same due to a unique ORDER BY), so `ROWS UNBOUNDED PRECEDING` should be better. — dnoeth, Feb 06 '17 at 17:22
@dnoeth: Thanks! You are absolutely right, my oversight. `RANGE` is the default. I adapted my answer accordingly. "Better" is defined by requirements, though. — Erwin Brandstetter, Feb 06 '17 at 18:00

Is there a faster way than 'ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW'

1 Answers1

Window frame definition

Proper date handling

Performance optimization

Proper `date` handling