3

I am writing a node.js application to enable search over a PostgreSQL database. In order to enable twitter type-ahead in the search box, I have to crunch a set of keywords from database to initialize Bloodhound before page loading. This is something like below:

SELECT distinct handlerid from lotintro where char_length(lotid)=7;

So for a large table (lotintro), this is costly; it is also stupid as the query result most likely stays the same for different web visitors over a period of time.

What is the proper way to handle this? I am thinking a few options:

1) Put the query in a stored procedure and call it from node.js:

   SELECT * from getallhandlerid()

Does it mean the query will be compiled and the database will automatically return the same result sets without actual running query knowing the result wouldn't have changed?

2) Or, create a separate table to store the distinct handlerid and update the table using a trigger which runs every day? (I know ideally, the trigger should run for every insert/update to the table, but this costs too much).

3) create a partial index as suggested. Here is what gathered:

Query

SELECT distinct handlerid from lotintro where length(lotid) = 7;

Index

CREATE INDEX lotid7_idx ON lotintro (handlerid)
WHERE  length(lotid) = 7;

With index, query cost around 250ms, try run

explain (analyze on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7

"HashAggregate  (cost=5542.64..5542.65 rows=1 width=6) (actual rows=151 loops=1)"
"  ->  Bitmap Heap Scan on lotintro  (cost=39.08..5537.50 rows=2056 width=6) (actual rows=298350 loops=1)"
"        Recheck Cond: (length(lotid) = 7)"
"        Rows Removed by Index Recheck: 55285"
"        ->  Bitmap Index Scan on lotid7_idx  (cost=0.00..38.57 rows=2056 width=0) (actual rows=298350 loops=1)"
"Total runtime: 243.686 ms"

Without index, query cost around 210ms, try run

explain (analyze on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7

"HashAggregate  (cost=19490.11..19490.12 rows=1 width=6) (actual rows=151 loops=1)"
"  ->  Seq Scan on lotintro  (cost=0.00..19484.97 rows=2056 width=6) (actual rows=298350 loops=1)"
"        Filter: (length(lotid) = 7)"
"        Rows Removed by Filter: 112915"
"Total runtime: 214.235 ms"

What am I doing wrong here?

4) Using alexius' suggested index and query:

create index on lotintro using btree(char_length(lotid), handlerid);

But it's not an optimal solution. Because there is only few distinct values you may use trick called loose index scan, which should work much faster in your case:

explain (analyze on, BUFFERS on, TIMING OFF)
WITH RECURSIVE t AS (
   (SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 ORDER BY handlerid LIMIT 1)  -- parentheses required
   UNION ALL
   SELECT (SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 AND handlerid > t.handlerid ORDER BY handlerid LIMIT 1)
   FROM t
   WHERE t.handlerid IS NOT NULL
   )
SELECT handlerid FROM t WHERE handlerid IS NOT NULL;

"CTE Scan on t  (cost=444.52..446.54 rows=100 width=32) (actual rows=151 loops=1)"
"  Filter: (handlerid IS NOT NULL)"
"  Rows Removed by Filter: 1"
"  Buffers: shared hit=608"
"  CTE t"
"    ->  Recursive Union  (cost=0.42..444.52 rows=101 width=32) (actual rows=152 loops=1)"
"          Buffers: shared hit=608"
"          ->  Limit  (cost=0.42..4.17 rows=1 width=6) (actual rows=1 loops=1)"
"                Buffers: shared hit=4"
"                ->  Index Scan using lotid_btree on lotintro lotintro_1  (cost=0.42..7704.41 rows=2056 width=6) (actual rows=1 loops=1)"
"                      Index Cond: (char_length(lotid) = 7)"
"                      Buffers: shared hit=4"
"          ->  WorkTable Scan on t t_1  (cost=0.00..43.83 rows=10 width=32) (actual rows=1 loops=152)"
"                Filter: (handlerid IS NOT NULL)"
"                Rows Removed by Filter: 0"
"                Buffers: shared hit=604"
"                SubPlan 1"
"                  ->  Limit  (cost=0.42..4.36 rows=1 width=6) (actual rows=1 loops=151)"
"                        Buffers: shared hit=604"
"                        ->  Index Scan using lotid_btree on lotintro  (cost=0.42..2698.13 rows=685 width=6) (actual rows=1 loops=151)"
"                              Index Cond: ((char_length(lotid) = 7) AND (handlerid > t_1.handlerid))"
"                              Buffers: shared hit=604"
"Planning time: 1.574 ms"
**"Execution time: 25.476 ms"**

========= more info on db ============================

dataloggerDB=# \d lotintro Table "public.lotintro"

    Column    |            Type             |  Modifiers
 --------------+-----------------------------+--------------
  lotstartdt   | timestamp without time zone | not null
  lotid        | text                        | not null
  ftc          | text                        | not null
  deviceid     | text                        | not null
  packageid    | text                        | not null
  testprogname | text                        | not null
  testprogdir  | text                        | not null
  testgrade    | text                        | not null
  testgroup    | text                        | not null
  temperature  | smallint                    | not null
  testerid     | text                        | not null
  handlerid    | text                        | not null
  numofsite    | text                        | not null
  masknum      | text                        |
  soaktime     | text                        |
  xamsqty      | smallint                    |
  scd          | text                        |
  speedgrade   | text                        |
  loginid      | text                        |
  operatorid   | text                        | not null
  loadboardid  | text                        | not null
  checksum     | text                        |
  lotenddt     | timestamp without time zone | not null
  totaltest    | integer                     | default (-1)
  totalpass    | integer                     | default (-1)
  earnhour     | real                        | default 0
  avetesttime  | real                        | default 0
  Indexes:
  "pkey_lotintro" PRIMARY KEY, btree (lotstartdt, testerid)
  "lotid7_idx" btree (handlerid) WHERE length(lotid) = 7
your version of Postgres,         [PostgreSQL 9.2]
cardinalities (how many rows?),   [411K rows for table lotintro]
percentage for length(lotid) = 7. [298350/411000=  73%]

============= after porting over everything to PG 9.4 =====================

With index:

explain (analyze on, BUFFERS on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7

"HashAggregate  (cost=5542.78..5542.79 rows=1 width=6) (actual rows=151 loops=1)"
"  Group Key: handlerid"
"  Buffers: shared hit=14242"
"  ->  Bitmap Heap Scan on lotintro  (cost=39.22..5537.64 rows=2056 width=6) (actual rows=298350 loops=1)"
"        Recheck Cond: (length(lotid) = 7)"
"        Heap Blocks: exact=13313"
"        Buffers: shared hit=14242"
"        ->  Bitmap Index Scan on lotid7_idx  (cost=0.00..38.70 rows=2056 width=0) (actual rows=298350 loops=1)"
"              Buffers: shared hit=929"
"Planning time: 0.256 ms"
"Execution time: 154.657 ms"

Without index:

explain (analyze on, BUFFERS on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7

"HashAggregate  (cost=19490.11..19490.12 rows=1 width=6) (actual rows=151 loops=1)"
"  Group Key: handlerid"
"  Buffers: shared hit=13316"
"  ->  Seq Scan on lotintro  (cost=0.00..19484.97 rows=2056 width=6) (actual rows=298350 loops=1)"
"        Filter: (length(lotid) = 7)"
"        Rows Removed by Filter: 112915"
"        Buffers: shared hit=13316"
"Planning time: 0.168 ms"
"Execution time: 176.466 ms"
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
sqr
  • 365
  • 2
  • 12
  • 29
  • "*will automatically return the same result sets without actual running query*" - no. the query will be run every time. 2) could be achieved by using a materialized view - probably the better approach –  May 09 '15 at 09:43
  • I just found a nice article on this. http://www.pgcon.org/2008/schedule/events/69.en.html – sqr May 09 '15 at 12:21
  • 1
    "*query cost around 200ms*" - query cost is not in ms (in fact it has no unit at all). The first query using the index has a cost of 5542, the second one (without using the index) has a cost of 19490 - 3.5 times **higher** - so the index usage is more efficient. If you want to get the real runtime of the query (in ms) you need to use `explain (analyze, timing)`. –  May 10 '15 at 10:17
  • @a_horse_with_no_name: To get the most accurate performance comparison `EXPLAIN (ANALYZE, TIMING OFF)` should be best (just without timing for inner details). – Erwin Brandstetter May 10 '15 at 10:26
  • sorry, my bad, the 200ms and 250ms was the average of what I saw at bottom right corner in PostgreSQL query tool when running the query. Is that the actual time duration to execute the query? – sqr May 10 '15 at 11:20
  • updated ticket with 'explain analyze' details. thanks. – sqr May 10 '15 at 11:31
  • More info please: table definition (`\d lotintro` in psql), your version of Postgres, cardinalities (how many rows?), percentage for `length(lotid) = 7`. For the purpose of optimizing, the output of `EXPLAIN (BUFFERS, ANALYZE)` is more useful. What happens if you repeat the test after `ANALYZE`, then after `VACUUM ANALYZE`, then (only if you can afford an exclusive lock!) after `VACUUM FULL ANALYZE`. You'll most probably see an index-only scan after `VACUUM FULL` that's substantially faster (but the effect deteriorates over time if you have lots of writes to the table.) – Erwin Brandstetter May 11 '15 at 00:23
  • hi @Erwin Brandstetter, I just updated the details in the questions. Yes it seems like vaccum might play a significant part here; when I ported over the data from PG9.2 to PG9.4, the performance is reversed. Indexed query does perform better. – sqr May 11 '15 at 04:05

3 Answers3

1

You need to index the exact expression that's used in your WHERE clause: http://www.postgresql.org/docs/9.4/static/indexes-expressional.html

CREATE INDEX char_length_lotid_idx ON lotintro (char_length(lotid));

You can also create a STABLE or IMMUTABLE function to wrap this query as you suggested: http://www.postgresql.org/docs/9.4/static/sql-createfunction.html

Your last suggestion is also viable, what you are looking for are MATERIALIZED VIEWS: http://www.postgresql.org/docs/9.4/static/sql-creatematerializedview.html This prevent you from writing a custom trigger to update the denormalized table.

Clément Prévost
  • 8,000
  • 2
  • 36
  • 51
  • You're welcome :). I wonder which solution works best for you and what volume of data you're dealing with. How long does the query last now ? and with the index ? – Clément Prévost May 09 '15 at 12:39
  • Hi @ Clément Prévost, the table lotintro has 411k rows, and distinct handlerid is around 150 rows. A plain SQL query or stored procedure cost about 220 ms. I was initially using PostgreSQL 9.2 which does't support MV. I now ported the same table to PostgreSQL 9.4 and yes, the MV works as expected, and it is almost instant (<1ms). – sqr May 11 '15 at 03:16
1

Since 3/4 of rows satisfy your condition (length(lotid) = 7) index itself won't help much. You might get a little better performance with this index because of index only scans:

create index on lotintro using btree(char_length(lotid), handlerid);

But it's not an optimal solution. Because there is only few distinct values you may use trick called loose index scan, which should work much faster in your case:

WITH RECURSIVE t AS (
   (SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 ORDER BY handlerid LIMIT 1)  -- parentheses required
   UNION ALL
   SELECT (SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 AND handlerid > t.handlerid ORDER BY handlerid LIMIT 1)
   FROM t
   WHERE t.handlerid IS NOT NULL
   )
SELECT handlerid FROM t WHERE handlerid IS NOT NULL;

for this query you also need to create index I mentioned above.

alexius
  • 2,501
  • 20
  • 21
  • Hi @alexius, this does perform a lot better although I need some more time digest what are you trying here. I will add your answer to the question statement also. Thanks! – sqr May 11 '15 at 04:40
0

1)

No, a function does not preserve snapshots of the result in any way. There is some potential for performance optimization if you define the function STABLE (which would be correct). Per documentation:

A STABLE function cannot modify the database and is guaranteed to return the same results given the same arguments for all rows within a single statement.

IMMUTABLE would be wrong here and potentially cause errors.

So this can hugely benefit multiple calls within the same statement - but that doesn't fit your use case ...

And plpgsql functions work like prepared statements giving you a similar bonus inside the same session:

2)

Try a MATERIALIZED VIEW. With or without MV (or some other caching technique), a partial index would be most efficient for your special case:

CREATE INDEX lotid7_idx ON lotintro (handlerid)
WHERE  length(lotid) = 7;

Remember to include the index condition in queries that are supposed to use the index, even if that seems redundant:

However, as you supplied:

percentage for length(lotid) = 7. [298350/411000= 73%]

That index is only going to help if you can get an index-only scan out of it because the condition is hardly selective. Since the table has very wide rows, index-only scans can be substantially faster.

Loose index scan

Also, rows=298350 are folded to rows=151, so a loose index scan will pay, as I explained here:

Or in the Postgres Wiki - which is actually based on this post.

WITH RECURSIVE t AS (
   (SELECT handlerid FROM lotintro
    WHERE  length(lotid) = 7
    ORDER  BY 1 LIMIT 1)

   UNION ALL
   SELECT (SELECT handlerid FROM lotintro
           WHERE  length(lotid) = 7
           AND    handlerid > t.handlerid
           ORDER  BY 1 LIMIT 1)
   FROM  t
   WHERE t.handlerid IS NOT NULL
   )
SELECT handlerid FROM t
WHERE  handlerid IS NOT NULL;

This is going to be faster, yet, in combination with the partial index I suggested. Since the partial index is only about half the size and updated less often (depends on access patterns), it's cheaper overall.

Faster still if you keep the table vacuumed to allow index-only scans. You can set more aggressive storage parameters for just this table if you have lots of writes to it:

Finally you can have this faster still with a materialized view based on this query.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Hi, @Erwin Brandstetter, I am trying the partial index here. It appears that it is getting worse after adding the index. I tried running 'EXPLAIN' and it confirm that the query is using index actually. What could go wrong here? -- I updated the details in question statement. Thanks! – sqr May 10 '15 at 10:10
  • @sqr: Can you repeat your test with `EXPLAIN (ANALYZE, TIMING OFF)`? (Best of 5.) – Erwin Brandstetter May 10 '15 at 10:23
  • multiple parties are editing in a few places of this ticket, and I end up saw Alex's answer before your edit on 'loose index scan'. As I see it now, your answer suggested all possible solutions with complete explanations, and I will mark your reply as answer. Thanks. -- I thought I could mark multiple answers earlier. – sqr May 11 '15 at 08:34