speeding up wildcard text lookups

Question

I have a simple table in Postgres with a bit over 8 million rows. The column of interest holds short text strings, typically one or more words total length less than 100 characters. It is set as 'character varying (100)'. The column is indexed. A simple look up like below takes > 3000 ms.

SELECT a, b, c FROM t WHERE a LIKE '?%'

Yes, for now, the need is to simply find the rows where "a" starts with the entered text. I want to bring the speed of look up down to under 100 ms (the appearance of instantaneous). Suggestions? Seems to me that full text search won't help here as my column of text is too short, but I would be happy to try that if worthwhile.

Oh, btw I also loaded the exact same data in mongodb and indexed column "a". Loading the data in mongodb was amazingly quick (mongodb++). Both mongodb and Postgres are pretty much instantaneous when doing exact lookups. But, Postgres actually shines when doing trailing wildcard searches as above, consistently taking about 1/3 as long as mongodb. I would be happy to pursue mongodb if I could speed that up as this is only a readonly operation.

Update: First, a couple of EXPLAIN ANALYZE outputs

EXPLAIN ANALYZE SELECT a, b, c FROM t WHERE a LIKE 'abcd%'

"Seq Scan on t  (cost=0.00..282075.55 rows=802 width=40) 
    (actual time=1220.132..1220.132 rows=0 loops=1)"
"  Filter: ((a)::text ~~ 'abcd%'::text)"
"Total runtime: 1220.153 ms"

I actually want to compare Lower(a) with the search term which is always at least 4 characters long, so

EXPLAIN ANALYZE SELECT a, b, c FROM t WHERE Lower(a) LIKE 'abcd%'

"Seq Scan on t  (cost=0.00..302680.04 rows=40612 width=40) 
    (actual time=4.681..3321.387 rows=788 loops=1)"
"  Filter: (lower((a)::text) ~~ 'abcd%'::text)"
"Total runtime: 3321.504 ms"

So I created an index

CREATE INDEX idx_t ON t USING btree (Lower(Substring(a, 1, 4) ));

"Seq Scan on t  (cost=0.00..302680.04 rows=40612 width=40) 
    (actual time=3243.841..3243.841 rows=0 loops=1)"
"  Filter: (lower((a)::text) = 'abcd%'::text)"
"Total runtime: 3243.860 ms"

Seems the only time an index is being used is when I am looking for an exact match

EXPLAIN ANALYZE SELECT a, b, c FROM t WHERE a = 'abcd'

"Index Scan using idx_t on geonames  (cost=0.00..57.89 rows=13 width=40) 
    (actual time=40.831..40.923 rows=17 loops=1)"
"  Index Cond: ((ascii_name)::text = 'Abcd'::text)"
"Total runtime: 40.940 ms"

Found a solution by implementing an index with varchar_pattern_ops, and am now looking for an even quicker lookups.

Index should work efficiently for `LIKE 'a%'`, but if the number of matching rows is large, they will all need to be read and transported back to the client, which takes time. How many rows are there in the query result? — Branko Dimitrijevic, Feb 09 '12 at 15:42
The number of matching rows typically is between 10 and 50. In all my measurements, I am not reporting the time in transporting data. I am merely reporting the time to actually select data... that is what I am interested in reducing. — punkish, Feb 09 '12 at 15:53
The >3000 ms is far too long for only 10-50 resulting rows. Is you execution plan OK? Is index actually being used by the query? — Branko Dimitrijevic, Feb 09 '12 at 16:02
Also, does the **second** execution of the identical query run equally slow? _(I'm wondering if you are experiencing a bad case of "cache pollution".)_ — Branko Dimitrijevic, Feb 09 '12 at 16:03

score 8 · Accepted Answer · edited Jun 20 '20 at 09:12

8

The PostgreSQL query planner is smart, but not an AI. To make it use an index on an expression use the exact same form of expression in the query.

With an index like this:

CREATE INDEX t_a_lower_idx ON t (lower(substring(a, 1, 4)));

Or simpler in PostgreSQL 9.1:

CREATE INDEX t_a_lower_idx ON t (lower(left(a, 4)));

Use this query:

SELECT * FROM t WHERE lower(left(a, 4)) = 'abcd';

Which is 100% functionally equivalent to:

SELECT * FROM t WHERE lower(a) LIKE 'abcd%'

Or:

SELECT * FROM t WHERE a ILIKE 'abcd%'

But not:

SELECT * FROM t WHERE a LIKE 'abcd%'

This is a functionally different query and you need a different index:

CREATE INDEX t_a_idx ON t (substring(a, 1, 4));

Or simpler with PostgreSQL 9.1:

CREATE INDEX t_a_idx ON t (left(a, 4));

And use this query:

SELECT * FROM t WHERE left(a, 4) = 'abcd';

Left anchored search terms of variable length

Case insensitive. Index:

Edit: Almost forgot: If you run your db with any other locale than the default 'C', you need to specify the operator class explicitly - text_pattern_ops in my example:

CREATE INDEX t_a_lower_idx
ON t (lower(left(a, <insert_max_length>)) text_pattern_ops);

Query:

SELECT * FROM t WHERE lower(left(a, <insert_max_length>)) ~~ 'abcdef%';

Can utilize the index and is almost as fast as the variant with a fixed length.

You may be interested in this post on dba.SE with more details about pattern matching, especially the last part about the operators ~>=~ and ~<~.

edited Jun 20 '20 at 09:12

Community

1
1

answered Feb 09 '12 at 23:05

Erwin Brandstetter

605,456
145
1,078
1,228

`CREATE INDEX t_a_lower_idx ON t (lower(substring(a, 1, 4)));` combined with `SELECT * FROM t WHERE lower(left(a, 4)) = 'abcd';` works great. Sadly, it fails for `SELECT * FROM t WHERE lower(a) LIKE 'abcde%'`. I am building an autocomplete box which is triggered on min 4 chars, but the user can type more than 4 to narrow the search. I want the performance of http://ninjawords.com or http://definr.com. Would a GIN index or WildSpeed index help? – punkish Feb 10 '12 at 03:53
@punkish: For left-anchored search terms, a plain b-tree index should deliver top performance. I added another bit to my answer for variable-length search terms. – Erwin Brandstetter Feb 10 '12 at 06:21
Sorry, the above advice doesn't seem to help at all. `CREATE INDEX idx_t_lower_a ON t (Lower(Substring(a, 1, 20)));` followed by `EXPLAIN ANALYZE SELECT a, b, c FROM t WHERE Lower(Substring(a, 1, 20)) ~~ 'abcde%'` gives me `Seq Scan on t (cost=0.00..322985.94 rows=40612 width=40) (actual time=82.393..4839.435 rows=403 loops=1) Filter: (lower("substring"((a)::text, 1, 20)) ~~ 'abcde%'::text) Total runtime: 4839.551 ms`. The actual query takes close to 4 seconds as well. Perhaps I misunderstood your instructions. – punkish Feb 12 '12 at 22:50
@punkish: Ah, I forgot the operator class. But you found out from my link I assume. Amended my answer. – Erwin Brandstetter Feb 13 '12 at 20:17

speeding up wildcard text lookups

1 Answers1

Left anchored search terms of variable length

Linked