SQL Join with GROUP BY query optimisation

Question

I'm trying to optimise the following query.

 SELECT C.name, COUNT(DISTINCT I.id), COUNT(B.id)
   FROM Categories C, Items I, Bids B
  WHERE C.id = I.category
    AND I.id = B.item_id
  GROUP BY C.name
  ORDER BY 2 DESC, 3 DESC;

Categories is a small table with 20 records.
Items is a large table with over 50,000 records.
Bids is a even larger table with over 600,000 records.

I have an index on

Categories(name, id), Items(category), and Bids(item_id, id).

The PRIMARY KEY for each table is: Items(id), Categories(id), Bids(id)

Is there any possibility to optimise the query? Very appreciated.

Avoid sorting in SQL and do in any other application you export to e.g. Excel. And re-write WHERE as ON using explicit Inner joins — QHarr, Oct 22 '17 at 08:13
I'm not sure how much more you can optimize this query because Postgres needs to count every record (in every group). — Tim Biegeleisen, Oct 22 '17 at 08:13
@QHarr Very good advice, but I don't think that would help. The optimizer probably already converts into an explicit join. — Tim Biegeleisen, Oct 22 '17 at 08:14
@TimBiegeleisen I'm guessing optimizer will convert. Also, probably wll re-arrange table ordering to have smaller first? — QHarr, Oct 22 '17 at 08:16
Also, i am familiar with SQL Server where putting as a stored procedure would allow for plan caching . — QHarr, Oct 22 '17 at 08:18
Instead of using `Cross join` use `join` but as a performance issue use `left join` with checking for not null values -HTH ;). — shA.t, Oct 22 '17 at 09:24
@philipxy *Misconceived* is unclear to me, what do you mean in details? - Side note: using `,` between tables mean `cross join` ;). — shA.t, Oct 23 '17 at 04:41
@shA.t Yeah, not clear. But I was mostly interesting in warning the asker. 1. *Because* ',' is just low-precedence cross join & on is just high-precedence where, the obligatory knee-jerk anti-',' "you might forget the where!" is [specious](https://stackoverflow.com/a/25957600/3404097). 2. If one wants a cross/inner join result one should use one, not a left join idiom--which foregoes just saying what you mean, has a slower naive implementation (although it's trivially optimized), and cannot even always be done (namely when a left/right row with all right/left columns null could be returned). — philipxy, Oct 23 '17 at 05:46
@philipxy Thanks for details, I think using `join` with `on` is better as a readability issue and - as I forgot to mention it - using `left join` with excluding nulls just for tables with low count of rows is faster; like `Categories`. [(1)](https://stackoverflow.com/a/8511004/4519059) ;). — shA.t, Oct 23 '17 at 06:22

score 0 · Answer 1 · answered Oct 22 '17 at 09:57

Without EXPLAIN (ANALYZE, BUFFERS) output this is guesswork.

The query is so simple that nothing can be optimized there.

Make sore that you cave correct table statistics; check EXPLAIN (ANALYZE) to see if PostgreSQL's estimates are correct.
Increase shared_buffers so that the whole database fits into RAM (if you can).
Increase work_mem so that all hashes and sorts are performed in memory.

score 0 · Answer 2 · answered Oct 22 '17 at 11:10

0

Not really you are scanning all records.

How many of the item records are hit with the data from bids. I would imagine all tables are full scanned and hash joined , and indexes disregarded.

answered Oct 22 '17 at 11:10

Ab Bennett

1,391
17
24

score 0 · Answer 3 · answered Oct 22 '17 at 20:31

ِYour query seems really boiler plate and I am sure that with the size of your tables, any not-really-low-hardware server can run this query in a heartbeat. But you can always make things better. Here's a list of optimizations you can make that are supposed to boost up your query's performance, theoretically:

Theoretically speaking, your biggest inefficiency here is that you are calculating cross product of your tables instead of joining them. You can rewrite the query with joins like:

...

FROM Items I 

INNER JOIN Bids B
ON I.id = B.item_id

INNER JOIN Categories C

ON C.id = I.category

...

If we are considering everything performance wise, your index on the category for the Items table is inefficient, since your index has only 20 entries that are mapped to 50K entries. This here is an inefficient index, and you may even get better performance without this index. However, from a practical point of view, there are a lot of other stuff to consider here, so this may not actually be a big deal.
You have no index on the ID column of the Items table and having an index on that column speeds up your first join. (However PostgreSQL has default index on primary key columns so this is not a big deal either)

Also, adding explain analyze to the beginning of your query shows you the plan that the PostgreSQL query planner uses to run you queries. If you know a thing or two about query plans, I suggest you take a look a the results of that too to find any missing inefficiencies.

"your biggest inefficiency here is that you are calculating cross product of your tables instead of joining them" This exhibits a profund lack of fundamentals of query optimization, if not query semantics. — philipxy, Oct 23 '17 at 06:14

SQL Join with GROUP BY query optimisation

3 Answers3