11

I have a table in a PostgreSQL 9.2 DB, created and filled as follows:

CREATE TABLE foo( id integer, date date );

INSERT INTO foo
SELECT (id % 10) + 1, now() - (id % 50) * interval '1 day'
FROM generate_series(1, 100000) AS id;

Now, I need to find all pairs (id, date) such that the date is a maximum one among all pairs with the same id. The query is kind of well known and it's common to use the window function called ROW_NUMBER()

SELECT id, date
FROM (
    SELECT id, date, ROW_NUMBER() OVER (PARTITION BY id ORDER BY date DESC) rn
    FROM foo
) sbt
WHERE sbt.rn = 1;

Now, I ask for the plan of that query and figured out that the WindowAgg node requires a table to be sorted first.

Subquery Scan on sbt  (cost=11116.32..14366.32 rows=500 width=8) (actual time=71.650..127.809 rows=10 loops=1)
  Filter: (sbt.rn = 1)
  Rows Removed by Filter: 99990
    ->  WindowAgg  (cost=11116.32..13116.32 rows=100000 width=8) (actual time=71.644..122.476 rows=100000 loops=1)
          ->  Sort  (cost=11116.32..11366.32 rows=100000 width=8) (actual time=71.637..92.081 rows=100000 loops=1)
                Sort Key: foo.id, foo.date
                Sort Method: external merge  Disk: 1752kB
                ->  Seq Scan on foo  (cost=0.00..1443.00 rows=100000 width=8) (actual time=0.006..6.138 rows=100000 loops=1)

As expected, the sorting took the majority of the query execution time and it would definitely helpful to use indexes.

So I create the CREATE INDEX ON foo(id, date) and expect that now it'll be use indexes. But it doesn't. I got the same plan with the external merge and even turning off the sequential scan wasn't useful. I just ended up with Bitmap Index Scan

->  Sort  (cost=12745.58..12995.58 rows=100000 width=8) (actual time=69.247..90.003 rows=100000 loops=1)
      Sort Key: foo.id, foo.date
      Sort Method: external merge  Disk: 1752kB
      ->  Bitmap Heap Scan on foo  (cost=1629.26..3072.26 rows=100000 width=8) (actual time=5.359..12.639 rows=100000 loops=1)
            ->  Bitmap Index Scan on foo_id_date_idx  (cost=0.00..1604.26 rows=100000 width=0) (actual time=5.299..5.299 rows=100000 loops=1)

QUESTION
Can WindowAgg ever use indexes for sorting? I think that it can't but don't understand why... GroupAggreagtes can and it gives good performance improving.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
St.Antario
  • 26,175
  • 41
  • 130
  • 318
  • Your sorting takes place on disk. Try to increase the `work_mem` for your session and see if that improves things –  Oct 14 '15 at 06:39
  • @a_horse_with_no_name Yes, I know that improving `work_mem` switches the sorting algorithm to `quick sort` if the memory is enought to do that. But why can't we use sorting provided with the index on `foo(id, date)` along with window functions? With aggregate ones we can. – St.Antario Oct 14 '15 at 06:42

2 Answers2

10

To match the index you created:

CREATE INDEX ON foo(id, date)

you would have to make that:

ROW_NUMBER() OVER (PARTITION BY id ORDER BY date DESC NULLS LAST) 

which is the perfect reverse order of ASC.

That aside, you could just run:

SELECT DISTINCT ON (id)
       id, date
FROM   foo
ORDER  BY id, date DESC NULLS LAST;

But that's probably not what you wanted to ask. Either way, I would make the index:

CREATE INDEX ON foo(id, date DESC NULLS LAST)

so that max(date) is the first index entry per id. Related:

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Unfortunately, it doesn't work for me. Are you sure about that or maybe I did something wrong...? – St.Antario Oct 14 '15 at 06:48
  • I mean `(PARTITION BY id ORDER BY date ASC NULLS LAST) ` – St.Antario Oct 14 '15 at 06:49
  • Awesom, you're just wizard!!! I recreated the index as `CREATE INDEX ON foo(id, date DESC NULLS LAST)` and it works completely fine (performance improved twice). But couldn't you add a little explanation or some references to read more about why creating indexes as I did doesn't work with window aggregates. – St.Antario Oct 14 '15 at 06:53
  • @St.Antario: I added some related answers with a lot more details and links. – Erwin Brandstetter Oct 14 '15 at 06:56
1

You can rewrite the logic "the date is a maximum one among all pairs with the same id" to a direct translation:

SELECT id, date
FROM (
    SELECT id, date, MAX(date) OVER (PARTITION BY id) as maxDate
    FROM foo
) sbt
WHERE date = maxDate;

It's not exactly the same as a ROW_NUMBER, but a RANK, it might return multiple rows with the same date

dnoeth
  • 59,503
  • 4
  • 39
  • 56