5

I recently asked a question and someone provided me with a solution that works but I forgot to mention there was that my table has millions of rows (~10 million for items table, ~1 millon for the others) and perhaps they thought I was working with a small dataset as in the example I provided.

Here is the SQL:

WITH a AS (
  SELECT item.id, string_agg(prefered_store.store::varchar, ',') wishlist_stores
  FROM item, list_wishlist, wishlist, prefered_store
  WHERE item.list=list_wishlist.list
    AND list_wishlist.wishlist=wishlist.id
    AND wishlist.prefered_stores=prefered_store.id
  GROUP BY item.id
), b AS (
  SELECT item.id, 
    string_agg(
      prefered_store.store::varchar || ',' || prefered_store.comment,
      ' ; ') item_stores_comments
    FROM item, prefered_store
    WHERE item.prefered_stores=prefered_store.id
    GROUP BY item.id
)
SELECT a.id,item_stores_comments,wishlist_stores 
FROM a,b
WHERE a.id=b.id

While it does exactly what I need, it's extremely slow. Takes ~10 minutes with limit of just one row and about 15 minutes for 10 rows. I'm still waiting to see how long it will take for a thousand rows (it's been almost and hour). Now my desktop is not the fastest: Pentium 4 with 1.5GB ram but it still doesn't feel right.

I've indexed all the fields in the WHERE clause and created primary keys where needed. Other than this is there any way to make this query run faster?

PostgreSQL 9.2

DDL: https://docs.google.com/file/d/0BwiyuwRCaqkCM09LVkJ4YlVNLWM/edit

Simple diagram with only relevant tables and fields: enter image description here

EXPLAIN ANALYZE:

Merge Join  (cost=23342752.95..12971604557.95 rows=863210883998 width=68) (actual time=1182616.544..1251542.167 rows=13139337 loops=1)
  Merge Cond: (a.id = b.id)
  CTE a
    ->  GroupAggregate  (cost=8477658.65..8992463.86 rows=13139337 width=8) (actual time=252170.500..307061.559 rows=13139337 loops=1)
          ->  Sort  (cost=8477658.65..8547771.35 rows=28045080 width=8) (actual time=252170.391..282495.516 rows=14870222 loops=1)
                Sort Key: public.item.id
                Sort Method: external merge  Disk: 261528kB
                ->  Merge Join  (cost=3010452.34..3474579.76 rows=28045080 width=8) (actual time=138444.102..210768.838 rows=14870222 loops=1)
                      Merge Cond: (list_wishlist.list = public.item.list)
                      ->  Sort  (cost=689954.53..695268.01 rows=2125390 width=8) (actual time=30482.447..55193.049 rows=1286901 loops=1)
                            Sort Key: list_wishlist.list
                            Sort Method: external merge  Disk: 22624kB
                            ->  Hash Join  (cost=66643.55..408462.52 rows=2125390 width=8) (actual time=10417.244..26147.517 rows=1286901 loops=1)
                                  Hash Cond: (wishlist.prefered_stores = public.prefered_store.id)
                                  ->  Hash Join  (cost=38565.70..96225.43 rows=1226863 width=8) (actual time=8188.097..19815.024 rows=1226863 loops=1)
                                        Hash Cond: (list_wishlist.wishlist = wishlist.id)
                                        ->  Seq Scan on list_wishlist  (cost=0.00..22266.63 rows=1226863 width=8) (actual time=12.786..7467.442 rows=1226863 loops=1)
                                        ->  Hash  (cost=20352.20..20352.20 rows=1110120 width=8) (actual time=7314.531..7314.531 rows=1110087 loops=1)
                                              Buckets: 4096  Batches: 64  Memory Usage: 689kB
                                              ->  Seq Scan on wishlist  (cost=0.00..20352.20 rows=1110120 width=8) (actual time=7.621..6572.731 rows=1110087 loops=1)
                                  ->  Hash  (cost=14027.49..14027.49 rows=856349 width=8) (actual time=2159.339..2159.339 rows=856349 loops=1)
                                        Buckets: 4096  Batches: 64  Memory Usage: 536kB
                                        ->  Seq Scan on prefered_store  (cost=0.00..14027.49 rows=856349 width=8) (actual time=0.071..1602.971 rows=856349 loops=1)
                      ->  Materialize  (cost=2320484.45..2386181.13 rows=13139337 width=8) (actual time=107961.603..149020.809 rows=14870219 loops=1)
                            ->  Sort  (cost=2320484.45..2353332.79 rows=13139337 width=8) (actual time=107961.575..145971.848 rows=13139337 loops=1)
                                  Sort Key: public.item.list
                                  Sort Method: external merge  Disk: 231088kB
                                  ->  Seq Scan on item  (cost=0.00..228006.37 rows=13139337 width=8) (actual time=27.636..47661.750 rows=13139337 loops=1)
  CTE b
    ->  GroupAggregate  (cost=7166704.38..7843349.46 rows=13139337 width=12) (actual time=524258.000..794537.585 rows=13139337 loops=1)
          ->  Sort  (cost=7166704.38..7223638.09 rows=22773483 width=12) (actual time=524257.908..755765.703 rows=13858612 loops=1)
                Sort Key: public.item.id
                Sort Method: external merge  Disk: 297912kB
                ->  Merge Join  (cost=2448353.26..2826901.79 rows=22773483 width=12) (actual time=201205.036..425873.108 rows=13858612 loops=1)
                      Merge Cond: (public.prefered_store.id = public.item.prefered_stores)
                      ->  Sort  (cost=127685.43..129826.31 rows=856349 width=12) (actual time=4545.447..12507.054 rows=856346 loops=1)
                            Sort Key: public.prefered_store.id
                            Sort Method: external merge  Disk: 18408kB
                            ->  Seq Scan on prefered_store  (cost=0.00..14027.49 rows=856349 width=12) (actual time=0.060..2707.353 rows=856349 loops=1)
                      ->  Materialize  (cost=2320484.45..2386181.13 rows=13139337 width=8) (actual time=196659.554..406944.706 rows=13858611 loops=1)
                            ->  Sort  (cost=2320484.45..2353332.79 rows=13139337 width=8) (actual time=196659.535..396917.629 rows=13139337 loops=1)
                                  Sort Key: public.item.prefered_stores
                                  Sort Method: external merge  Disk: 231096kB
                                  ->  Seq Scan on item  (cost=0.00..228006.37 rows=13139337 width=8) (actual time=0.032..54885.583 rows=13139337 loops=1)
  ->  Sort  (cost=3253469.82..3286318.16 rows=13139337 width=36) (actual time=344329.838..353118.692 rows=13139337 loops=1)
        Sort Key: a.id
        Sort Method: external sort  Disk: 259792kB
        ->  CTE Scan on a  (cost=0.00..262786.74 rows=13139337 width=36) (actual time=252170.512..320132.738 rows=13139337 loops=1)
  ->  Materialize  (cost=3253469.82..3319166.50 rows=13139337 width=36) (actual time=838286.670..888495.578 rows=13139337 loops=1)
        ->  Sort  (cost=3253469.82..3286318.16 rows=13139337 width=36) (actual time=838286.652..886198.912 rows=13139337 loops=1)
              Sort Key: b.id
              Sort Method: external sort  Disk: 385320kB
              ->  CTE Scan on b  (cost=0.00..262786.74 rows=13139337 width=36) (actual time=524258.017..811717.462 rows=13139337 loops=1)
Total runtime: 1253101.865 ms
Community
  • 1
  • 1
userBG
  • 6,860
  • 10
  • 31
  • 39
  • You cannot profile SQL intuitively; you have to use testing tools. Start at [PostgreSQL Performance Analysis Tools](http://wiki.postgresql.org/wiki/Performance_Analysis_Tools). – Dour High Arch Jan 26 '13 at 01:27
  • 2
    Edit your question, and paste the output of `EXPLAIN ANALYZE`. – Mike Sherrill 'Cat Recall' Jan 26 '13 at 01:59
  • 1
    1) You have a total lack of usable keys for the "preferred_stores" table. 2) You repeat the product {preferred_store*item}, that could be squeezed into a **common** table expression, too. (currently you are effectively doing two "squared" full table scans. The resulting hash tables are too large to fit into core and spill to disk. – wildplasser Jan 26 '13 at 12:26
  • Do you mind pasting the SQL DDL for those tables into your question? – Mike Sherrill 'Cat Recall' Jan 26 '13 at 12:58
  • The store and comment tables are missing. – wildplasser Jan 26 '13 at 17:24
  • The prefered_store table does not have a column "id". – wildplasser Jan 26 '13 at 17:32
  • @wildplasser no because since each item and wishlist can have more than one prefered store, there will be duplicate prefered_store.prefered_store. That's why i created prefered_store_ids which contains unique ids which prefered_store.prefered_store references via CONSTRAINT "prefered_store_prefered_store_fkey" FOREIGN KEY Do you think and primary key id field is needed for that table as well? – userBG Jan 26 '13 at 17:36
  • I actually should have named that foreign key prefered_store_prefered_store_ids_fkey instead of prefered_store_prefered_store_fkey – userBG Jan 26 '13 at 17:42
  • With the supplied DDL the original query does not even run. I'm quitting, since I don't want to reverse engineer this mess. Good luck. – wildplasser Jan 26 '13 at 18:10
  • I was talking about a common table *expression*, not about a common table. – wildplasser Jan 26 '13 at 18:19
  • What is the meaning of `prefered_sore.id` and `prefered_store.store` fields? Which of them is the key? Primary key? – vyegorov Jan 26 '13 at 20:50

3 Answers3

4

See, in PostgreSQL CTEs are acting as an optimization fence. As a result no matter which predicates you will add to the outer part of the query, PostgreSQL will perform a full traverse of the WITH() clauses. So to optimize you need to get rid of CTEs.

This is not easy as you prefered_store table is denormalized.

Please, try this query (also on SQL Fiddle):

SELECT i.id item,
    (SELECT string_agg(store||','||comment, ';')
       FROM prefered_store WHERE id=i.prefered_stores) item_stores_comments,
    string_agg(ps.store::text, ',') whishlist_stores
  FROM item i
  JOIN list_wishlist lw ON lw.list=i.list
  JOIN wishlist w ON w.id=lw.wishlist
  JOIN prefered_store ps ON ps.id=w.prefered_stores
 GROUP BY i.id;

But I would recommend reviewing schema design.

vyegorov
  • 21,787
  • 7
  • 59
  • 73
  • @ofko, Could you kindly include `EXPLAIN (analyze, buffers) ...` of this query from your schema, please? – vyegorov Jan 26 '13 at 23:20
  • It didn't finish when I tried without a limit. I got `[Err] Out of memory` I guess I should do it in chunks of a million rows or so, or do you have a suggestion? – userBG Jan 27 '13 at 00:15
1

Building columns of comma-separated values in SQL is almost always the wrong approach. Better to return rows of data, and let application code deal with display formatting.

Test by eliminating the string_agg() function and the GROUP BY clause.

with a as (
  select item.id, 
         prefered_store.store wishlist_stores
  from item
  inner join list_wishlist on item.list=list_wishlist.list
  inner join wishlist on list_wishlist.wishlist=wishlist.id
  inner join prefered_store on wishlist.prefered_stores=prefered_store.id
), b as (
  select item.id, 
         prefered_store.store,
         prefered_store.comment item_stores_comments
    from item
    inner join prefered_store on item.prefered_stores=prefered_store.id
)
select * from a
inner join b on a.id = b.id

The SQL Fiddle you posted isn't much use. It has no primary keys, no secondary indexes, and not enough rows to avoid sequential scans.

Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185
0

Because you are using the "with"?
Why not use "UNION"?
If there is no "ForeignKey" null tables, you can use "LEFT JOIN" but it's a quick comparison.

If you post a diagram I can redo the SQL :D

Edgard Leal
  • 2,592
  • 26
  • 30
  • A common table expression ("with") cannot be simply replaced with a UNION. Those are two completely different things. –  Jan 26 '13 at 08:25
  • I agree with you, but it seems that he is using "WITH" in order to mount the "CROSS JOIN" – Edgard Leal Jan 26 '13 at 11:23