I am writing a node.js application to enable search over a PostgreSQL database. In order to enable twitter type-ahead in the search box, I have to crunch a set of keywords from database to initialize Bloodhound before page loading. This is something like below:
SELECT distinct handlerid from lotintro where char_length(lotid)=7;
So for a large table (lotintro), this is costly; it is also stupid as the query result most likely stays the same for different web visitors over a period of time.
What is the proper way to handle this? I am thinking a few options:
1) Put the query in a stored procedure and call it from node.js:
SELECT * from getallhandlerid()
Does it mean the query will be compiled and the database will automatically return the same result sets without actual running query knowing the result wouldn't have changed?
2) Or, create a separate table to store the distinct handlerid
and update the table using a trigger which runs every day? (I know ideally, the trigger should run for every insert/update to the table, but this costs too much).
3) create a partial index as suggested. Here is what gathered:
Query
SELECT distinct handlerid from lotintro where length(lotid) = 7;
Index
CREATE INDEX lotid7_idx ON lotintro (handlerid)
WHERE length(lotid) = 7;
With index, query cost around 250ms, try run
explain (analyze on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7
"HashAggregate (cost=5542.64..5542.65 rows=1 width=6) (actual rows=151 loops=1)"
" -> Bitmap Heap Scan on lotintro (cost=39.08..5537.50 rows=2056 width=6) (actual rows=298350 loops=1)"
" Recheck Cond: (length(lotid) = 7)"
" Rows Removed by Index Recheck: 55285"
" -> Bitmap Index Scan on lotid7_idx (cost=0.00..38.57 rows=2056 width=0) (actual rows=298350 loops=1)"
"Total runtime: 243.686 ms"
Without index, query cost around 210ms, try run
explain (analyze on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7
"HashAggregate (cost=19490.11..19490.12 rows=1 width=6) (actual rows=151 loops=1)"
" -> Seq Scan on lotintro (cost=0.00..19484.97 rows=2056 width=6) (actual rows=298350 loops=1)"
" Filter: (length(lotid) = 7)"
" Rows Removed by Filter: 112915"
"Total runtime: 214.235 ms"
What am I doing wrong here?
4) Using alexius' suggested index and query:
create index on lotintro using btree(char_length(lotid), handlerid);
But it's not an optimal solution. Because there is only few distinct values you may use trick called loose index scan, which should work much faster in your case:
explain (analyze on, BUFFERS on, TIMING OFF)
WITH RECURSIVE t AS (
(SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 ORDER BY handlerid LIMIT 1) -- parentheses required
UNION ALL
SELECT (SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 AND handlerid > t.handlerid ORDER BY handlerid LIMIT 1)
FROM t
WHERE t.handlerid IS NOT NULL
)
SELECT handlerid FROM t WHERE handlerid IS NOT NULL;
"CTE Scan on t (cost=444.52..446.54 rows=100 width=32) (actual rows=151 loops=1)"
" Filter: (handlerid IS NOT NULL)"
" Rows Removed by Filter: 1"
" Buffers: shared hit=608"
" CTE t"
" -> Recursive Union (cost=0.42..444.52 rows=101 width=32) (actual rows=152 loops=1)"
" Buffers: shared hit=608"
" -> Limit (cost=0.42..4.17 rows=1 width=6) (actual rows=1 loops=1)"
" Buffers: shared hit=4"
" -> Index Scan using lotid_btree on lotintro lotintro_1 (cost=0.42..7704.41 rows=2056 width=6) (actual rows=1 loops=1)"
" Index Cond: (char_length(lotid) = 7)"
" Buffers: shared hit=4"
" -> WorkTable Scan on t t_1 (cost=0.00..43.83 rows=10 width=32) (actual rows=1 loops=152)"
" Filter: (handlerid IS NOT NULL)"
" Rows Removed by Filter: 0"
" Buffers: shared hit=604"
" SubPlan 1"
" -> Limit (cost=0.42..4.36 rows=1 width=6) (actual rows=1 loops=151)"
" Buffers: shared hit=604"
" -> Index Scan using lotid_btree on lotintro (cost=0.42..2698.13 rows=685 width=6) (actual rows=1 loops=151)"
" Index Cond: ((char_length(lotid) = 7) AND (handlerid > t_1.handlerid))"
" Buffers: shared hit=604"
"Planning time: 1.574 ms"
**"Execution time: 25.476 ms"**
========= more info on db ============================
dataloggerDB=# \d lotintro Table "public.lotintro"
Column | Type | Modifiers
--------------+-----------------------------+--------------
lotstartdt | timestamp without time zone | not null
lotid | text | not null
ftc | text | not null
deviceid | text | not null
packageid | text | not null
testprogname | text | not null
testprogdir | text | not null
testgrade | text | not null
testgroup | text | not null
temperature | smallint | not null
testerid | text | not null
handlerid | text | not null
numofsite | text | not null
masknum | text |
soaktime | text |
xamsqty | smallint |
scd | text |
speedgrade | text |
loginid | text |
operatorid | text | not null
loadboardid | text | not null
checksum | text |
lotenddt | timestamp without time zone | not null
totaltest | integer | default (-1)
totalpass | integer | default (-1)
earnhour | real | default 0
avetesttime | real | default 0
Indexes:
"pkey_lotintro" PRIMARY KEY, btree (lotstartdt, testerid)
"lotid7_idx" btree (handlerid) WHERE length(lotid) = 7
your version of Postgres, [PostgreSQL 9.2] cardinalities (how many rows?), [411K rows for table lotintro] percentage for length(lotid) = 7. [298350/411000= 73%]
============= after porting over everything to PG 9.4 =====================
With index:
explain (analyze on, BUFFERS on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7
"HashAggregate (cost=5542.78..5542.79 rows=1 width=6) (actual rows=151 loops=1)"
" Group Key: handlerid"
" Buffers: shared hit=14242"
" -> Bitmap Heap Scan on lotintro (cost=39.22..5537.64 rows=2056 width=6) (actual rows=298350 loops=1)"
" Recheck Cond: (length(lotid) = 7)"
" Heap Blocks: exact=13313"
" Buffers: shared hit=14242"
" -> Bitmap Index Scan on lotid7_idx (cost=0.00..38.70 rows=2056 width=0) (actual rows=298350 loops=1)"
" Buffers: shared hit=929"
"Planning time: 0.256 ms"
"Execution time: 154.657 ms"
Without index:
explain (analyze on, BUFFERS on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7
"HashAggregate (cost=19490.11..19490.12 rows=1 width=6) (actual rows=151 loops=1)"
" Group Key: handlerid"
" Buffers: shared hit=13316"
" -> Seq Scan on lotintro (cost=0.00..19484.97 rows=2056 width=6) (actual rows=298350 loops=1)"
" Filter: (length(lotid) = 7)"
" Rows Removed by Filter: 112915"
" Buffers: shared hit=13316"
"Planning time: 0.168 ms"
"Execution time: 176.466 ms"