4

I am trying to get the count of certain types of records in a related table. I am using a left join.

So I have a query that isn't quite right and one that is returning the correct results. The correct results query has a higher execution cost. Id like to use the first approach, if I can correct the results. (see http://sqlfiddle.com/#!15/7c20b/5/2)

CREATE TABLE people(
  id SERIAL,
  name varchar not null
);

CREATE TABLE pets(
  id SERIAL,
  name varchar not null, 
  kind varchar not null,
  alive boolean not null default false,
  person_id integer not null
);

INSERT INTO people(name) VALUES
('Chad'),
('Buck'); --can't keep pets alive

INSERT INTO pets(name, alive, kind, person_id) VALUES
('doggio', true, 'dog', 1),
('dog master flash', true, 'dog', 1),
('catio', true, 'cat', 1),
('lucky', false, 'cat', 2);

My goal is to get a table back with ALL of the people and the counts of the KINDS of pets they have alive:

| ID | ALIVE_DOGS_COUNT | ALIVE_CATS_COUNT |
|----|------------------|------------------|
|  1 |                2 |                1 |
|  2 |                0 |                0 |

I made the example more trivial. In our production app (not really pets) there would be about 100,000 dead dogs and cats per person. Pretty screwed up I know, but this example is simpler to relay ;) I was hoping to filter all the 'dead' stuff out before the count. I have the slower query in production now (from sqlfiddle above), but would love to get the LEFT JOIN version working.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Cory
  • 63
  • 1
  • 7

3 Answers3

1

Typically fastest if you fetch all or most rows:

SELECT pp.id
     , COALESCE(pt.a_dog_ct, 0) AS alive_dogs_count
     , COALESCE(pt.a_cat_ct, 0) AS alive_cats_count
FROM   people pp
LEFT   JOIN (
   SELECT person_id
        , count(kind = 'dog' OR NULL) AS a_dog_ct
        , count(kind = 'cat' OR NULL) AS a_cat_ct
   FROM   pets
   WHERE  alive
   GROUP  BY 1
   ) pt ON pt.person_id = pp.id;

Indexes are irrelevant here, full table scans will be fastest. Except if alive pets are a rare case, then a partial index should help. Like:

CREATE INDEX pets_alive_idx ON pets (person_id, kind) WHERE alive;

I included all columns needed for the query (person_id, kind) to allow index-only scans.

SQL Fiddle.

Typically fastest for a small subset or a single row:

SELECT pp.id
     , count(kind = 'dog' OR NULL) AS alive_dogs_count
     , count(kind = 'cat' OR NULL) AS alive_cats_count
FROM   people pp
LEFT   JOIN pets pt ON pt.person_id = pp.id
                   AND pt.alive
WHERE  <some condition to retrieve a small subset>
GROUP  BY 1;

You should at least have an index on pets.person_id for this (or the partial index from above) - and possibly more, depending ion the WHERE condition.

Related answers:

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • 1
    Awesome, and doubly informative. I think I have SELECT'd the right answer. See what I did there. – Cory Oct 18 '14 at 01:03
  • @Cory: You may be interested in the added bits about indexing. – Erwin Brandstetter Oct 18 '14 at 01:17
  • The number of non-alive pets increases over time, the table has related indexes, but its has about 1.5% of the records 'alive'. I actually left this out, but the table is partitioned based on the 'alive' flag and when the state changes records are moved to the partition of dead things. Very morbid up in here. – Cory Oct 18 '14 at 18:00
0

Your WHERE alive=true is actually filtering out record for person_id = 2. Use the below query, push the WHERE alive=true condition into the CASE condition as can be noticed here. See your modified Fiddle

SELECT people.id,
pe.alive_dogs_count,
pe.alive_cats_count
FROM people
LEFT JOIN 
(
select person_id, 
  COALESCE(SUM(case when pets.kind='dog' and alive = true then 1 else 0 end),0) as alive_dogs_count,
  COALESCE(SUM(case when pets.kind='cat' and alive = true then 1 else 0 end),0) as alive_cats_count
from pets
GROUP BY person_id
) pe on people.id = pe.person_id

(OR) your version

SELECT 
  people.id,
  COALESCE(SUM(case when pets.kind='dog' and alive = true then 1 else 0 end),0) as alive_dogs_count,
  COALESCE(SUM(case when pets.kind='cat' and alive = true then 1 else 0 end),0) as alive_cats_count
FROM people
  LEFT JOIN pets on people.id = pets.person_id
GROUP BY people.id;
Rahul
  • 76,197
  • 13
  • 71
  • 125
  • It gets a significantly higher cost in this scenario. Thats the third in the sqlfiddle I posted. – Cory Oct 18 '14 at 00:39
  • In the first query, if a person has no pets, we would still get `NULL` values. `COALESCE` needs to move from the subquery to the outer query. And it can be simplified further. – Erwin Brandstetter Oct 18 '14 at 01:05
0

JOIN with SUM

I think your original query was something like this:

SELECT people.id, stats.dog, stats.cat
  FROM people
  JOIN (SELECT person_id, count(kind)filter(where kind='dog') dog, count(kind)filter(where kind='cat') cat FROM pets WHERE alive GROUP BY person_id) stats
    ON stats.person_id = people.id

That works smoothly, but you should understand, that the result will miss the people with 0 pets, because of inner join. In order to include people who miss pets, you can:

  1. firstly LEFT JOIN,
  2. then GROUP BY joined result
  3. and be ready for NULL values instead of counts.

See the accepted answer above. Credits to @ErwinBrandstetter

Slowness

In contrast to other DBMS', Postgresql doesn't create indexes for foreign keys. One multicolumn index will be more efficient than three single indexes. Extend the foreign key index with extra columns from WHERE and JOIN ON columns in the right order:

CREATE INDEX people_fk_with_kind_alive ON test2 (person_id, alive, kind);

REF: https://postgresql.org/docs/11/indexes-multicolumn.html Of course, your primary keys should be defined. The primary key will be indexed by default.

epox
  • 9,236
  • 1
  • 55
  • 38