3

I need to quickly select a value ( baz ) from the "earliest" ( MIN(save_date) ) rows grouped by an their foo_id. The following query returns the correct rows (well almost, it can return multiples for each foo_id when there are duplicate save_dates).

The foos table contains about 55k rows and the samples table contains about 25 million rows.

CREATE TABLE foos (
    foo_id     int,
    val        varchar(40),
    # ref_id is a FK, constraint omitted for brevity
    ref_id     int
)
CREATE TABLE samples (
    sample_id  int,
    save_date  date,
    baz        smallint,
    # foo_id is a FK, constraint omitted for brevity
    foo_id     int
)

WITH foo ( foo_id, val ) AS (
        SELECT foo_id, val FROM foos
        WHERE foos.ref_id = 1
    ORDER BY foos.val ASC
    LIMIT 25 OFFSET 0
)
SELECT foo.val, firsts.baz
FROM foo
LEFT JOIN (
    SELECT A.baz, A.foo_id
    FROM samples A
    INNER JOIN (
        SELECT foo_id, MIN( save_date ) AS save_date
        FROM samples
        GROUP BY foo_id
    ) B
    USING ( foo_id, save_date )
) firsts USING ( foo_id )

This query currently takes over 100 seconds; I'd like to see this return in ~1 second (or less!).

How can I write this query to be optimal?


Updated; adding explains:

Obviously the actual query I'm using isn't using tables foo, baz, etc.

The "dumbed down" example query's (from above) explain:

Hash Right Join  (cost=337.69..635.47 rows=3 width=100)
  Hash Cond: (a.foo_id = foo.foo_id)
  CTE foo
    ->  Limit  (cost=71.52..71.53 rows=3 width=102)
          ->  Sort  (cost=71.52..71.53 rows=3 width=102)
                Sort Key: foos.val
                ->  Seq Scan on foos  (cost=0.00..71.50 rows=3 width=102)
                      Filter: (ref_id = 1)
  ->  Hash Join  (cost=265.25..562.90 rows=9 width=6)
        Hash Cond: ((a.foo_id = samples.foo_id) AND (a.save_date = (min(samples.save_date))))
        ->  Seq Scan on samples a  (cost=0.00..195.00 rows=1850 width=10)
        ->  Hash  (cost=244.25..244.25 rows=200 width=8)
              ->  HashAggregate  (cost=204.25..224.25 rows=200 width=8)
                    ->  Seq Scan on samples  (cost=0.00..195.00 rows=1850 width=8)
  ->  Hash  (cost=0.60..0.60 rows=3 width=102)
        ->  CTE Scan on foo  (cost=0.00..0.60 rows=3 width=102)
David Murdoch
  • 87,823
  • 39
  • 148
  • 191
  • What's your explain look like? – lc. Jul 27 '12 at 19:49
  • 1
    Have you tried with correlated subquery? Might be faster because you only need 25 foos. Joining 25M rows to condensed version of itself might be too much. And covering index on samples (foo_id, save_date) seems very much needed. – Nikola Markovinović Jul 27 '12 at 20:02
  • @lc., added `explain` for you. :-) – David Murdoch Jul 27 '12 at 20:07
  • @NikolaMarkovinović I was thinking the inner join would accomplish the same thing. I'll check the indices now. Thanks! – David Murdoch Jul 27 '12 at 20:08
  • `it can return multiples for each foo_id when there are duplicate save_dates` .. So what do you *want* to have returned in that case. multiples one random pick or one specific row (specify!) from the set of dupes? – Erwin Brandstetter Jul 28 '12 at 00:02

2 Answers2

3

If I understand the question, you want windowing.

WITH find_first AS (
  SELECT foo_id, baz,
    row_number()
  OVER (PARTITION BY foo_id ORDER BY foo_id, save_date) AS rnum
  FROM samples
)
SELECT foo_id, baz FROM find_first WHERE rnum = 1;

Using row_number instead of rank eliminates duplicates and guarantees only one baz per foo. If you need to know against foos that have no bazzes, just LEFT JOIN the foos table to this query.

With an index on (foo_id, save_date), the optimizer should be smart enough to do the grouping keeping only one baz and skipping merrily along.

Andrew Lazarus
  • 18,205
  • 3
  • 35
  • 53
2

row_number() is a beautiful beast, but DISTINCT ON is simpler here.

WITH foo AS (
    SELECT foo_id
    FROM   foos
    WHERE  ref_id = 1
    ORDER  BY val
    LIMIT  25 OFFSET 0
    )
SELECT DISTINCT ON (1) f.foo_id, s.baz
FROM   foo f
LEFT   JOIN samples s USING (foo_id)
ORDER  BY f.foo_id, s.save_date, s.baz;

This is assuming you want exactly 1 row per foo_id. If there are multiple rows in sample sharing the same earliest save_date, baz serves as tie-breaker.

The case is very similar to this question from yesterday.

More advice:

  • Don't select val in the CTE, you only need it in ORDER BY.

  • To avoid expensive sequential scans on foos:

    • If you are always after rows from foos with ref_id = 1, create a partial multi-column index:

      CREATE INDEX foos_val_part_idx ON foos (val)
      WHERE ref_id = 1;
      
    • If ref_id is variable:

      CREATE INDEX foos_ref_id_val_idx ON foos (ref_id, val);
      
  • The other index that would help best on samples:

    CREATE INDEX samples_foo_id_save_date_baz_idx
    ON samples (foo_id, save_date, baz);
    

These indexes become even more effective with the new "index-only scans" in version 9.2. Details and links here.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228