SQL join 1-many querying only parent rows without duplicates

Question

I have two tables - invoices and invoiceitems. The relationship is 1-many. My application allows to query invoices using the invoice item fields in the query. Only the invoices are returned, no items.

For instance, I would like to get all the invoices having items, which name contains ac, case insensitive. The output is paginated, so I do one query to get the count of the invoices satisfying the condition and then another query to get the respective page of invoices.

The table sizes are:

invoices - 65,000 records
invoiceitems - 3,281,518 records
terms - 5 items
reps - 5 items
shipVia - 5 items

Each invoice is linked to at most 100 invoice items.

My problem is that I cannot figure out the optimal indexes for my query:

Schema:

CREATE TABLE invoiceitems
(
  id serial NOT NULL,
  invoice_id integer NOT NULL,
  name text NOT NULL,
  ...
  CONSTRAINT invoiceitems_pkey PRIMARY KEY (id),
  CONSTRAINT invoiceitems_invoice_id_fkey FOREIGN KEY (invoice_id)
      REFERENCES invoices (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
);

CREATE INDEX idx_lower_name
  ON invoiceitems
  USING btree
  (lower(name) COLLATE pg_catalog."default" text_pattern_ops);

CREATE TABLE invoices
(
  id serial NOT NULL,
  term_id integer,
  rep_id integer NOT NULL,
  ship_via_id integer,
  ...
  CONSTRAINT invoices_pkey PRIMARY KEY (id),
  CONSTRAINT invoices_rep_id_fkey FOREIGN KEY (rep_id)
      REFERENCES reps (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT invoices_ship_via_id_fkey FOREIGN KEY (ship_via_id)
      REFERENCES shipvia (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT invoices_term_id_fkey FOREIGN KEY (term_id)
      REFERENCES terms (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
);

The count query:

SELECT COUNT(DISTINCT(o.id))
FROM invoices o
JOIN invoiceitems items ON items.invoice_id = o.id
LEFT JOIN terms t ON t.id = o.term_id
LEFT JOIN reps r ON r.id = o.rep_id
LEFT JOIN shipVia s ON s.id = o.ship_via_id WHERE LOWER(items.name) LIKE '%ac%';

Result:

6518

Query plan

"Aggregate  (cost=107651.35..107651.36 rows=1 width=4)"
"  ->  Hash Join  (cost=3989.50..106010.59 rows=656304 width=4)"
"        Hash Cond: (items.invoice_id = o.id)"
"        ->  Seq Scan on invoiceitems items  (cost=0.00..85089.77 rows=656304 width=4)"
"              Filter: (lower(name) ~~ '%ac%'::text)"
"        ->  Hash  (cost=2859.00..2859.00 rows=65000 width=16)"
"              ->  Seq Scan on invoices o  (cost=0.00..2859.00 rows=65000 width=16)"

It seems like my functional index on the invoiceitems.name field does not play at all. I think it is because I am looking for a part of name, which is not a strict prefix of the name. I am not sure, but it seems that my invoices primary key index does not work here too.

My question is can I optimize the count query and or my schema to be more performant?

I must allow searching by parts of name, which are not strict prefixes and I also must support case insensitive search.

My query to return the matching records is equally bad:

SELECT DISTINCT(o.id), t.terms, r.rep, s.ship_via, ...
FROM invoices o
JOIN invoiceitems items ON items.invoice_id = o.id
LEFT JOIN terms t ON t.id = o.term_id
LEFT JOIN reps r ON r.id = o.rep_id
LEFT JOIN shipVia s ON s.id = o.ship_via_id WHERE LOWER(items.name) LIKE '%ac%' LIMIT 100;

And its plan:

"Limit  (cost=901846.63..901854.13 rows=100 width=627)"
"  ->  Unique  (cost=901846.63..951069.43 rows=656304 width=627)"
"        ->  Sort  (cost=901846.63..903487.39 rows=656304 width=627)"
"              Sort Key: o.id, t.terms, r.rep, s.ship_via, ..."
"              ->  Hash Join  (cost=11509.54..286596.53 rows=656304 width=627)"
"                    Hash Cond: (items.invoice_id = o.id)"
"                    ->  Seq Scan on invoiceitems items  (cost=0.00..85089.77 rows=656304 width=4)"
"                          Filter: (lower(name) ~~ '%ac%'::text)"
"                    ->  Hash  (cost=5491.03..5491.03 rows=65000 width=627)"
"                          ->  Hash Left Join  (cost=113.02..5491.03 rows=65000 width=627)"
"                                Hash Cond: (o.ship_via_id = s.id)"
"                                ->  Hash Left Join  (cost=75.35..4559.61 rows=65000 width=599)"
"                                      Hash Cond: (o.rep_id = r.id)"
"                                      ->  Hash Left Join  (cost=37.67..3628.19 rows=65000 width=571)"
"                                            Hash Cond: (o.term_id = t.id)"
"                                            ->  Seq Scan on invoices o  (cost=0.00..2859.00 rows=65000 width=543)"
"                                            ->  Hash  (cost=22.30..22.30 rows=1230 width=36)"
"                                                  ->  Seq Scan on terms t  (cost=0.00..22.30 rows=1230 width=36)"
"                                      ->  Hash  (cost=22.30..22.30 rows=1230 width=36)"
"                                            ->  Seq Scan on reps r  (cost=0.00..22.30 rows=1230 width=36)"
"                                ->  Hash  (cost=22.30..22.30 rows=1230 width=36)"
"                                      ->  Seq Scan on shipvia s  (cost=0.00..22.30 rows=1230 width=36)"

I am limited to PostgreSQL. Switching to SQL Server is not an option.

EDIT ==================================================================

I have followed the very informative instructions of Erwin and here is what I have.

The index:

CREATE INDEX invoiceitems_name_gin_trgm_idx ON invoiceitems USING gin (name gin_trgm_ops);

The count query with JOIN, but without the extra tables:

EXPLAIN ANALYZE SELECT COUNT(DISTINCT(o.id)) 
FROM invoices o 
JOIN invoiceitems items ON items.invoice_id = o.id 
WHERE items.name ILIKE '%ac%';

"Aggregate  (cost=78961.52..78961.53 rows=1 width=4) (actual time=5205.448..5205.450 rows=1 loops=1)"
"  ->  Nested Loop  (cost=0.00..78960.73 rows=316 width=4) (actual time=0.396..5176.761 rows=6518 loops=1)"
"        ->  Seq Scan on invoiceitems items  (cost=0.00..76885.98 rows=316 width=4) (actual time=0.021..4502.043 rows=6518 loops=1)"
"              Filter: (name ~~* '%ac%'::text)"
"              Rows Removed by Filter: 3275000"
"        ->  Index Only Scan using invoices_pkey on invoices o  (cost=0.00..6.56 rows=1 width=4) (actual time=0.012..0.015 rows=1 loops=6518)"
"              Index Cond: (id = items.invoice_id)"
"              Heap Fetches: 6518"
"Total runtime: 5205.509 ms"

The count query with semi join:

EXPLAIN ANALYZE SELECT COUNT(1)
FROM   invoices o
WHERE EXISTS (
   SELECT 1
   FROM   invoiceitems i 
   WHERE  i.invoice_id = o.id
   AND    i.name ILIKE '%ac%'
   );

"Aggregate  (cost=76920.43..76920.44 rows=1 width=0) (actual time=5713.597..5713.598 rows=1 loops=1)"
"  ->  Nested Loop  (cost=76886.76..76919.64 rows=316 width=0) (actual time=5583.706..5703.801 rows=6518 loops=1)"
"        ->  HashAggregate  (cost=76886.76..76886.82 rows=5 width=4) (actual time=5583.568..5594.977 rows=6518 loops=1)"
"              ->  Seq Scan on invoiceitems i  (cost=0.00..76885.98 rows=316 width=4) (actual time=0.295..5148.801 rows=6518 loops=1)"
"                    Filter: (name ~~* '%ac%'::text)"
"                    Rows Removed by Filter: 3275000"
"        ->  Index Only Scan using invoices_pkey on invoices o  (cost=0.00..6.56 rows=1 width=4) (actual time=0.006..0.008 rows=1 loops=6518)"
"              Index Cond: (id = i.invoice_id)"
"              Heap Fetches: 6518"
"Total runtime: 5713.804 ms"

The semi-join seems to have no effect. Why?

(I do not think it matters, but I removed the original functional index on lower(invoiceitems.name)).

EDIT 2=================================================================

I would like to focus on the fetch rows query and provide a bit more of the context.

First of all, the user may demand to order the columns by an arbitrary field from the invoice (but not from the invoice items).

In addition, the user may provide a list of filter statements involving both invoice and invoice item fields. These filter statements capture the semantics of filtering by a string or numeric value, for instance, a filter could be "invoice item name contains 'ac' and invoice discount is higher than 5%"

It is perfectly clear to me, that I am unlikely to have every single field indexed, I will probably have to index only the most common fields, like invoice item name and a few others.

Anyway, here are the indexes I have so far on the invoices and invoiceitems tables:

invoices

id as the PRIMARY KEY

invoiceitems

id as the PRIMARY KEY
CREATE INDEX invoiceitems_invoice_id_idx ON invoiceitems USING btree (invoice_id);
CREATE INDEX invoiceitems_name_gin_trgm_idx ON invoiceitems USING gin (name COLLATE pg_catalog."default" gin_trgm_ops);

Here is the analysis of the fetch rows query using JOIN against invoice items:

explain analyze
SELECT DISTINCT(o.id), t.terms, r.rep, s.ship_via, ...
FROM invoices o
JOIN invoiceitems items ON items.invoice_id = o.id
LEFT JOIN terms t ON t.id = o.term_id
LEFT JOIN reps r ON r.id = o.rep_id
LEFT JOIN shipVia s ON s.id = o.ship_via_id 
WHERE (items.name ILIKE '%df%' AND items.name IS NOT NULL) LIMIT 100;

"Limit  (cost=79100.70..79106.95 rows=100 width=312) (actual time=4637.195..4637.195 rows=0 loops=1)"
"  ->  Unique  (cost=79100.70..79120.45 rows=316 width=312) (actual time=4637.190..4637.190 rows=0 loops=1)"
"        ->  Sort  (cost=79100.70..79101.49 rows=316 width=312) (actual time=4637.186..4637.186 rows=0 loops=1)"
"              Sort Key: o.id, o.customer, o.business_no, o.bill_to_name, o.bill_to_address1, o.bill_to_address2, o.bill_to_postal_code, o.ship_to_name, o.ship_to_address1, o.ship_to_address2, o.ship_to_postal_code, o.purchase_order_no, t.terms, r.rep, ((o.ship_date)::text), s.ship_via, o.delivery, o.hst_percents, o.sub_total, o.total_before_hst, o.total, o.total_discount, o.hst, o.item_count"
"              Sort Method: quicksort  Memory: 25kB"
"              ->  Hash Left Join  (cost=113.02..79087.58 rows=316 width=312) (actual time=4637.179..4637.179 rows=0 loops=1)"
"                    Hash Cond: (o.ship_via_id = s.id)"
"                    ->  Hash Left Join  (cost=75.35..79043.98 rows=316 width=284) (actual time=4637.123..4637.123 rows=0 loops=1)"
"                          Hash Cond: (o.rep_id = r.id)"
"                          ->  Hash Left Join  (cost=37.67..79001.96 rows=316 width=256) (actual time=4637.119..4637.119 rows=0 loops=1)"
"                                Hash Cond: (o.term_id = t.id)"
"                                ->  Nested Loop  (cost=0.00..78960.73 rows=316 width=228) (actual time=4637.115..4637.115 rows=0 loops=1)"
"                                      ->  Seq Scan on invoiceitems items  (cost=0.00..76885.98 rows=316 width=4) (actual time=4637.108..4637.108 rows=0 loops=1)"
"                                            Filter: ((name IS NOT NULL) AND (name ~~* '%df%'::text))"
"                                            Rows Removed by Filter: 3281518"
"                                      ->  Index Scan using invoices_pkey on invoices o  (cost=0.00..6.56 rows=1 width=228) (never executed)"
"                                            Index Cond: (id = items.invoice_id)"
"                                ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (never executed)"
"                                      ->  Seq Scan on terms t  (cost=0.00..22.30 rows=1230 width=36) (never executed)"
"                          ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (never executed)"
"                                ->  Seq Scan on reps r  (cost=0.00..22.30 rows=1230 width=36) (never executed)"
"                    ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (never executed)"
"                          ->  Seq Scan on shipvia s  (cost=0.00..22.30 rows=1230 width=36) (never executed)"
"Total runtime: 4637.731 ms"

Here is the analysis of the fetch rows query using WHERE EXISTS instead of the JOIN against invoice items:

explain analyze
SELECT o.id, t.terms, r.rep, s.ship_via, ...
FROM invoices o
LEFT JOIN terms t ON t.id = o.term_id
LEFT JOIN reps r ON r.id = o.rep_id
LEFT JOIN shipVia s ON s.id = o.ship_via_id 
WHERE EXISTS (
   SELECT 1
   FROM   invoiceitems i 
   WHERE  i.invoice_id = o.id
   AND    i.name ILIKE '%df%'
   AND    i.name IS NOT NULL
   ) LIMIT 100;

"Limit  (cost=0.19..43302.88 rows=100 width=610) (actual time=5771.852..5771.852 rows=0 loops=1)"
"  ->  Nested Loop Left Join  (cost=0.19..136836.68 rows=316 width=610) (actual time=5771.848..5771.848 rows=0 loops=1)"
"        ->  Nested Loop Left Join  (cost=0.19..135404.33 rows=316 width=582) (actual time=5771.844..5771.844 rows=0 loops=1)"
"              ->  Nested Loop Left Join  (cost=0.19..134052.55 rows=316 width=554) (actual time=5771.841..5771.841 rows=0 loops=1)"
"                    ->  Merge Semi Join  (cost=0.19..132700.78 rows=316 width=526) (actual time=5771.837..5771.837 rows=0 loops=1)"
"                          Merge Cond: (o.id = i.invoice_id)"
"                          ->  Index Scan using invoices_pkey on invoices o  (cost=0.00..3907.27 rows=65000 width=526) (actual time=0.017..0.017 rows=1 loops=1)"
"                          ->  Index Scan using invoiceitems_invoice_id_idx on invoiceitems i  (cost=0.00..129298.19 rows=316 width=4) (actual time=5771.812..5771.812 rows=0 loops=1)"
"                                Filter: ((name IS NOT NULL) AND (name ~~* '%df%'::text))"
"                                Rows Removed by Filter: 3281518"
"                    ->  Index Scan using terms_pkey on terms t  (cost=0.00..4.27 rows=1 width=36) (never executed)"
"                          Index Cond: (id = o.term_id)"
"              ->  Index Scan using reps_pkey on reps r  (cost=0.00..4.27 rows=1 width=36) (never executed)"
"                    Index Cond: (id = o.rep_id)"
"        ->  Index Scan using shipvia_pkey on shipvia s  (cost=0.00..4.27 rows=1 width=36) (never executed)"
"              Index Cond: (id = o.ship_via_id)"
"Total runtime: 5771.948 ms"

I did not try the third option, which orders the invoiceitems rows by distinct invoice_id, because this approach seems to be viable only when the ordering is not given, whereas usually the contrary is true - the ordering is present.

If you want a case insensitive search on varchar items, use sql server. Otherwise you have to use an upper() or lower() function in your where clause which is guaranteed to be slow. Any indexes you have will simply not be used, — Dan Bracuk, Oct 01 '13 at 02:39
I am limited to postgresql. Switching to SQL Server is not an option. — mark, Oct 01 '13 at 02:43
I know Oracle lets you define functional indices, which is an index on the result of a function. I'm fairly sure PosgreSQL will allow you to the same. If so, you could create an index on the upper() or lower() of your varchar items, then search on that for a case insensitive search. — Darius Makaitis, Oct 01 '13 at 02:51
invoice_items.name looks like a hidden domain to me. How many distinct values are there for the item name? — wildplasser, Oct 06 '13 at 16:09

Erwin Brandstetter · Accepted Answer · 2023-08-01T19:31:21.110

Indices

Trigram index

Use a trigram index, provided by the module pg_trgm, which provides operator classes for GIN or GiST indexes to support all LIKE and ILIKE patterns, not just left-anchored ones. See:

Pattern matching with LIKE, SIMILAR TO or regular expressions

Additional btree index

Of course, you also need a plain btree index (default) on invoiceitems.invoice_id:

CREATE INDEX invoiceitems_invoice_id_idx ON invoiceitems (invoice_id);

Multi-column index for index-only scan

You may get some additional benefit from making this index "covering" for an index-only scan. A GIN index doesn't normally make sense for an integer column like invoice_id. But to save additional heap look-ups, it may pay to include it in a multicolumn GIN (or GiST) index. You'll have to test.

For this, you need the additional module btree_gin (or btree_gist respectively). Example with GIN:

CREATE EXTENSION btree_gin;

CREATE INDEX invoiceitems_name_gin_trgm_idx
ON invoiceitems USING gin (name gin_trgm_ops, invoice_id);

This would obviate the need for above B-tree index. But be sure to create it anyway, to speed up FK-checks, and for many other purposes.

Queries

Counting

For a ...

query to get the count of the invoices

... omit additional tables that can only do harm (if anything):

SELECT COUNT(DISTINCT(item.invoice_id))
FROM   invoiceitems item 
JOIN   invoices o ON item.invoice_id = o.id
LEFT   JOIN terms t ON t.id = o.term_id
LEFT   JOIN reps r ON r.id = o.rep_id
LEFT   JOIN shipVia s ON s.id = o.ship_via_id
WHERE  item.name ILIKE '%ac%';

Since your foreign key constraint guarantees referential integrity, you can even omit the table invoices from this query. Your shiny new index should kick in!

Returning rows

And for returning items:

EXISTS would still be good here:

SELECT t.terms, r.rep, s.ship_via, ...
FROM   invoices     o
LEFT   JOIN terms   t ON t.id = o.term_id
LEFT   JOIN reps    r ON r.id = o.rep_id
LEFT   JOIN shipVia s ON s.id = o.ship_via_id
WHERE  EXISTS (
   SELECT FROM invoiceitems i 
   WHERE  i.invoice_id = o.id
   AND    i.name ILIKE '%ac%'
   )
-- ORDER BY ???
LIMIT 100;

Or you can test this variant which joins to the above query as sub-select. May be even faster:

SELECT t.terms, r.rep, s.ship_via, ...
FROM  (
   SELECT DISTINCT invoice_id
   FROM   invoiceitems
   WHERE  name ILIKE '%ac%'
   ORDER  BY invoice_id           -- order by id = cheapest with above index
   LIMIT  100                     -- LIMIT early!
   ) item
JOIN   invoices     o ON o.id = item.invoice_id
LEFT   JOIN terms   t ON t.id = o.term_id
LEFT   JOIN reps    r ON r.id = o.rep_id
LEFT   JOIN shipVia s ON s.id = o.ship_via_id
-- ORDER BY ???
;

This example gets the first 100 by invoice_id (since you did not provide a sort order). It all depends on the details ...

Thanks for such an informative answer. I have updated my question - please, see the EDIT section. Somehow, the semi index does not seem to cut in. — mark, Oct 01 '13 at 11:28
@mark: The remaining core problem is the missing index on `invoiceitems.invoice_id`. Also added some other improvements. — Erwin Brandstetter, Oct 01 '13 at 12:34
@mark It would be interesting which of the last two queries performs better in your case ... — Erwin Brandstetter, Oct 03 '13 at 18:37
I will post my results, but it is going to take me more time than it took you to answer :-). — mark, Oct 04 '13 at 15:03
I posted an update describing my results. I think I am still missing an index in order to make the WHERE EXIST approach work. — mark, Oct 06 '13 at 14:18