I have two tables - invoices
and invoiceitems
. The relationship is 1-many. My application allows to query invoices using the invoice item fields in the query. Only the invoices are returned, no items.
For instance, I would like to get all the invoices having items, which name contains ac
, case insensitive.
The output is paginated, so I do one query to get the count of the invoices satisfying the condition and then another query to get the respective page of invoices.
The table sizes are:
- invoices - 65,000 records
- invoiceitems - 3,281,518 records
- terms - 5 items
- reps - 5 items
- shipVia - 5 items
Each invoice is linked to at most 100 invoice items.
My problem is that I cannot figure out the optimal indexes for my query:
Schema:
CREATE TABLE invoiceitems
(
id serial NOT NULL,
invoice_id integer NOT NULL,
name text NOT NULL,
...
CONSTRAINT invoiceitems_pkey PRIMARY KEY (id),
CONSTRAINT invoiceitems_invoice_id_fkey FOREIGN KEY (invoice_id)
REFERENCES invoices (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
);
CREATE INDEX idx_lower_name
ON invoiceitems
USING btree
(lower(name) COLLATE pg_catalog."default" text_pattern_ops);
CREATE TABLE invoices
(
id serial NOT NULL,
term_id integer,
rep_id integer NOT NULL,
ship_via_id integer,
...
CONSTRAINT invoices_pkey PRIMARY KEY (id),
CONSTRAINT invoices_rep_id_fkey FOREIGN KEY (rep_id)
REFERENCES reps (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT invoices_ship_via_id_fkey FOREIGN KEY (ship_via_id)
REFERENCES shipvia (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT invoices_term_id_fkey FOREIGN KEY (term_id)
REFERENCES terms (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
);
The count query:
SELECT COUNT(DISTINCT(o.id))
FROM invoices o
JOIN invoiceitems items ON items.invoice_id = o.id
LEFT JOIN terms t ON t.id = o.term_id
LEFT JOIN reps r ON r.id = o.rep_id
LEFT JOIN shipVia s ON s.id = o.ship_via_id WHERE LOWER(items.name) LIKE '%ac%';
Result:
6518
Query plan
"Aggregate (cost=107651.35..107651.36 rows=1 width=4)"
" -> Hash Join (cost=3989.50..106010.59 rows=656304 width=4)"
" Hash Cond: (items.invoice_id = o.id)"
" -> Seq Scan on invoiceitems items (cost=0.00..85089.77 rows=656304 width=4)"
" Filter: (lower(name) ~~ '%ac%'::text)"
" -> Hash (cost=2859.00..2859.00 rows=65000 width=16)"
" -> Seq Scan on invoices o (cost=0.00..2859.00 rows=65000 width=16)"
It seems like my functional index on the invoiceitems.name
field does not play at all. I think it is because I am looking for a part of name, which is not a strict prefix of the name. I am not sure, but it seems that my invoices primary key index does not work here too.
My question is can I optimize the count query and or my schema to be more performant?
I must allow searching by parts of name, which are not strict prefixes and I also must support case insensitive search.
My query to return the matching records is equally bad:
SELECT DISTINCT(o.id), t.terms, r.rep, s.ship_via, ...
FROM invoices o
JOIN invoiceitems items ON items.invoice_id = o.id
LEFT JOIN terms t ON t.id = o.term_id
LEFT JOIN reps r ON r.id = o.rep_id
LEFT JOIN shipVia s ON s.id = o.ship_via_id WHERE LOWER(items.name) LIKE '%ac%' LIMIT 100;
And its plan:
"Limit (cost=901846.63..901854.13 rows=100 width=627)"
" -> Unique (cost=901846.63..951069.43 rows=656304 width=627)"
" -> Sort (cost=901846.63..903487.39 rows=656304 width=627)"
" Sort Key: o.id, t.terms, r.rep, s.ship_via, ..."
" -> Hash Join (cost=11509.54..286596.53 rows=656304 width=627)"
" Hash Cond: (items.invoice_id = o.id)"
" -> Seq Scan on invoiceitems items (cost=0.00..85089.77 rows=656304 width=4)"
" Filter: (lower(name) ~~ '%ac%'::text)"
" -> Hash (cost=5491.03..5491.03 rows=65000 width=627)"
" -> Hash Left Join (cost=113.02..5491.03 rows=65000 width=627)"
" Hash Cond: (o.ship_via_id = s.id)"
" -> Hash Left Join (cost=75.35..4559.61 rows=65000 width=599)"
" Hash Cond: (o.rep_id = r.id)"
" -> Hash Left Join (cost=37.67..3628.19 rows=65000 width=571)"
" Hash Cond: (o.term_id = t.id)"
" -> Seq Scan on invoices o (cost=0.00..2859.00 rows=65000 width=543)"
" -> Hash (cost=22.30..22.30 rows=1230 width=36)"
" -> Seq Scan on terms t (cost=0.00..22.30 rows=1230 width=36)"
" -> Hash (cost=22.30..22.30 rows=1230 width=36)"
" -> Seq Scan on reps r (cost=0.00..22.30 rows=1230 width=36)"
" -> Hash (cost=22.30..22.30 rows=1230 width=36)"
" -> Seq Scan on shipvia s (cost=0.00..22.30 rows=1230 width=36)"
I am limited to PostgreSQL. Switching to SQL Server is not an option.
EDIT ==================================================================
I have followed the very informative instructions of Erwin and here is what I have.
The index:
CREATE INDEX invoiceitems_name_gin_trgm_idx ON invoiceitems USING gin (name gin_trgm_ops);
The count query with JOIN, but without the extra tables:
EXPLAIN ANALYZE SELECT COUNT(DISTINCT(o.id))
FROM invoices o
JOIN invoiceitems items ON items.invoice_id = o.id
WHERE items.name ILIKE '%ac%';
"Aggregate (cost=78961.52..78961.53 rows=1 width=4) (actual time=5205.448..5205.450 rows=1 loops=1)"
" -> Nested Loop (cost=0.00..78960.73 rows=316 width=4) (actual time=0.396..5176.761 rows=6518 loops=1)"
" -> Seq Scan on invoiceitems items (cost=0.00..76885.98 rows=316 width=4) (actual time=0.021..4502.043 rows=6518 loops=1)"
" Filter: (name ~~* '%ac%'::text)"
" Rows Removed by Filter: 3275000"
" -> Index Only Scan using invoices_pkey on invoices o (cost=0.00..6.56 rows=1 width=4) (actual time=0.012..0.015 rows=1 loops=6518)"
" Index Cond: (id = items.invoice_id)"
" Heap Fetches: 6518"
"Total runtime: 5205.509 ms"
The count query with semi join:
EXPLAIN ANALYZE SELECT COUNT(1)
FROM invoices o
WHERE EXISTS (
SELECT 1
FROM invoiceitems i
WHERE i.invoice_id = o.id
AND i.name ILIKE '%ac%'
);
"Aggregate (cost=76920.43..76920.44 rows=1 width=0) (actual time=5713.597..5713.598 rows=1 loops=1)"
" -> Nested Loop (cost=76886.76..76919.64 rows=316 width=0) (actual time=5583.706..5703.801 rows=6518 loops=1)"
" -> HashAggregate (cost=76886.76..76886.82 rows=5 width=4) (actual time=5583.568..5594.977 rows=6518 loops=1)"
" -> Seq Scan on invoiceitems i (cost=0.00..76885.98 rows=316 width=4) (actual time=0.295..5148.801 rows=6518 loops=1)"
" Filter: (name ~~* '%ac%'::text)"
" Rows Removed by Filter: 3275000"
" -> Index Only Scan using invoices_pkey on invoices o (cost=0.00..6.56 rows=1 width=4) (actual time=0.006..0.008 rows=1 loops=6518)"
" Index Cond: (id = i.invoice_id)"
" Heap Fetches: 6518"
"Total runtime: 5713.804 ms"
The semi-join seems to have no effect. Why?
(I do not think it matters, but I removed the original functional index on lower(invoiceitems.name)
).
EDIT 2=================================================================
I would like to focus on the fetch rows query and provide a bit more of the context.
First of all, the user may demand to order the columns by an arbitrary field from the invoice (but not from the invoice items).
In addition, the user may provide a list of filter statements involving both invoice and invoice item fields. These filter statements capture the semantics of filtering by a string or numeric value, for instance, a filter could be "invoice item name contains 'ac' and invoice discount is higher than 5%"
It is perfectly clear to me, that I am unlikely to have every single field indexed, I will probably have to index only the most common fields, like invoice item name and a few others.
Anyway, here are the indexes I have so far on the invoices and invoiceitems tables:
invoices
- id as the PRIMARY KEY
invoiceitems
- id as the PRIMARY KEY
CREATE INDEX invoiceitems_invoice_id_idx ON invoiceitems USING btree (invoice_id);
CREATE INDEX invoiceitems_name_gin_trgm_idx ON invoiceitems USING gin (name COLLATE pg_catalog."default" gin_trgm_ops);
Here is the analysis of the fetch rows query using JOIN against invoice items:
explain analyze
SELECT DISTINCT(o.id), t.terms, r.rep, s.ship_via, ...
FROM invoices o
JOIN invoiceitems items ON items.invoice_id = o.id
LEFT JOIN terms t ON t.id = o.term_id
LEFT JOIN reps r ON r.id = o.rep_id
LEFT JOIN shipVia s ON s.id = o.ship_via_id
WHERE (items.name ILIKE '%df%' AND items.name IS NOT NULL) LIMIT 100;
"Limit (cost=79100.70..79106.95 rows=100 width=312) (actual time=4637.195..4637.195 rows=0 loops=1)"
" -> Unique (cost=79100.70..79120.45 rows=316 width=312) (actual time=4637.190..4637.190 rows=0 loops=1)"
" -> Sort (cost=79100.70..79101.49 rows=316 width=312) (actual time=4637.186..4637.186 rows=0 loops=1)"
" Sort Key: o.id, o.customer, o.business_no, o.bill_to_name, o.bill_to_address1, o.bill_to_address2, o.bill_to_postal_code, o.ship_to_name, o.ship_to_address1, o.ship_to_address2, o.ship_to_postal_code, o.purchase_order_no, t.terms, r.rep, ((o.ship_date)::text), s.ship_via, o.delivery, o.hst_percents, o.sub_total, o.total_before_hst, o.total, o.total_discount, o.hst, o.item_count"
" Sort Method: quicksort Memory: 25kB"
" -> Hash Left Join (cost=113.02..79087.58 rows=316 width=312) (actual time=4637.179..4637.179 rows=0 loops=1)"
" Hash Cond: (o.ship_via_id = s.id)"
" -> Hash Left Join (cost=75.35..79043.98 rows=316 width=284) (actual time=4637.123..4637.123 rows=0 loops=1)"
" Hash Cond: (o.rep_id = r.id)"
" -> Hash Left Join (cost=37.67..79001.96 rows=316 width=256) (actual time=4637.119..4637.119 rows=0 loops=1)"
" Hash Cond: (o.term_id = t.id)"
" -> Nested Loop (cost=0.00..78960.73 rows=316 width=228) (actual time=4637.115..4637.115 rows=0 loops=1)"
" -> Seq Scan on invoiceitems items (cost=0.00..76885.98 rows=316 width=4) (actual time=4637.108..4637.108 rows=0 loops=1)"
" Filter: ((name IS NOT NULL) AND (name ~~* '%df%'::text))"
" Rows Removed by Filter: 3281518"
" -> Index Scan using invoices_pkey on invoices o (cost=0.00..6.56 rows=1 width=228) (never executed)"
" Index Cond: (id = items.invoice_id)"
" -> Hash (cost=22.30..22.30 rows=1230 width=36) (never executed)"
" -> Seq Scan on terms t (cost=0.00..22.30 rows=1230 width=36) (never executed)"
" -> Hash (cost=22.30..22.30 rows=1230 width=36) (never executed)"
" -> Seq Scan on reps r (cost=0.00..22.30 rows=1230 width=36) (never executed)"
" -> Hash (cost=22.30..22.30 rows=1230 width=36) (never executed)"
" -> Seq Scan on shipvia s (cost=0.00..22.30 rows=1230 width=36) (never executed)"
"Total runtime: 4637.731 ms"
Here is the analysis of the fetch rows query using WHERE EXISTS instead of the JOIN against invoice items:
explain analyze
SELECT o.id, t.terms, r.rep, s.ship_via, ...
FROM invoices o
LEFT JOIN terms t ON t.id = o.term_id
LEFT JOIN reps r ON r.id = o.rep_id
LEFT JOIN shipVia s ON s.id = o.ship_via_id
WHERE EXISTS (
SELECT 1
FROM invoiceitems i
WHERE i.invoice_id = o.id
AND i.name ILIKE '%df%'
AND i.name IS NOT NULL
) LIMIT 100;
"Limit (cost=0.19..43302.88 rows=100 width=610) (actual time=5771.852..5771.852 rows=0 loops=1)"
" -> Nested Loop Left Join (cost=0.19..136836.68 rows=316 width=610) (actual time=5771.848..5771.848 rows=0 loops=1)"
" -> Nested Loop Left Join (cost=0.19..135404.33 rows=316 width=582) (actual time=5771.844..5771.844 rows=0 loops=1)"
" -> Nested Loop Left Join (cost=0.19..134052.55 rows=316 width=554) (actual time=5771.841..5771.841 rows=0 loops=1)"
" -> Merge Semi Join (cost=0.19..132700.78 rows=316 width=526) (actual time=5771.837..5771.837 rows=0 loops=1)"
" Merge Cond: (o.id = i.invoice_id)"
" -> Index Scan using invoices_pkey on invoices o (cost=0.00..3907.27 rows=65000 width=526) (actual time=0.017..0.017 rows=1 loops=1)"
" -> Index Scan using invoiceitems_invoice_id_idx on invoiceitems i (cost=0.00..129298.19 rows=316 width=4) (actual time=5771.812..5771.812 rows=0 loops=1)"
" Filter: ((name IS NOT NULL) AND (name ~~* '%df%'::text))"
" Rows Removed by Filter: 3281518"
" -> Index Scan using terms_pkey on terms t (cost=0.00..4.27 rows=1 width=36) (never executed)"
" Index Cond: (id = o.term_id)"
" -> Index Scan using reps_pkey on reps r (cost=0.00..4.27 rows=1 width=36) (never executed)"
" Index Cond: (id = o.rep_id)"
" -> Index Scan using shipvia_pkey on shipvia s (cost=0.00..4.27 rows=1 width=36) (never executed)"
" Index Cond: (id = o.ship_via_id)"
"Total runtime: 5771.948 ms"
I did not try the third option, which orders the invoiceitems rows by distinct invoice_id, because this approach seems to be viable only when the ordering is not given, whereas usually the contrary is true - the ordering is present.