Slow join query with wildcard

Question

I have a keyword table containing a few millions entries. It is linked by a many to many relationship to an element table. I would like to get all the element ids matching a keyword. I tried this query and there is no problem, it returns rows in few milliseconds.

SELECT element_id FROM element_keyword 
JOIN keyword ON keyword.id = element_keyword.keyword_id
WHERE keyword.value like 'LOREM';

Execution plan

"Nested Loop  (cost=278.50..53665.56 rows=65 width=4)"
"  ->  Index Scan using keyword_value_index on keyword  (cost=0.43..8.45 rows=1 width=4)"
"        Index Cond: ((value)::text = 'LOREM'::text)"
"        Filter: ((value)::text ~~ 'LOREM'::text)"
"  ->  Bitmap Heap Scan on element_keyword  (cost=278.07..53510.66 rows=14645 width=8)"
"        Recheck Cond: (keyword_id = keyword.id)"
"        ->  Bitmap Index Scan on element_keyword_keyword_index  (cost=0.00..274.41 rows=14645 width=0)"
"              Index Cond: (keyword_id = keyword.id)"

However when i put a wild card at the end of my search string the request becomes really slow. (~60000ms)

SELECT element_id FROM element_keyword 
JOIN keyword ON keyword.id = element_keyword.keyword_id
WHERE keyword.value like 'LOREM%';

Execution plan:

"Hash Join  (cost=12.20..3733738.08 rows=19502 width=4)"
"  Hash Cond: (element_keyword.keyword_id = keyword.id)"
"  ->  Seq Scan on element_keyword  (cost=0.00..3002628.08 rows=194907408 width=8)"
"  ->  Hash  (cost=8.45..8.45 rows=300 width=4)"
"        ->  Index Scan using keyword_value_index on keyword  (cost=0.43..8.45 rows=300 width=4)"
"              Index Cond: (((value)::text ~>=~ 'LOREM'::text) AND ((value)::text ~<~ 'LOREN'::text))"
"              Filter: ((value)::text ~~ 'LOREM%'::text)"

Even when the wildcard does not give more result, the query is slow.

I created an index on keyword(value) and element_keyword(keyword_id)

CREATE INDEX "keyword_value_index" ON "keyword" (value text_pattern_ops);
CREATE INDEX "element_keyword_keyword_index" ON "element_keyword" (keyword_id);

What's really happening behind ? How could i solve it ?

UPDATE

I'm not sure if this could help but here's a few more tests :

select id from keyword where value like 'LOREM%'; 
-> 6 ids retrieved in 17ms 

select * from element_keyword where keyword_id in (1961746,1961710,2724258,2121442,1633163,1026116); 
-> 40 rows retrieved in 17ms

select * from element_keyword where keyword_id in (select id from keyword where value like 'LOREM%');
-> 40 rows in 63221 ms

well to start with if you are designing your database so that you need to do wildcard searches, then you need full text search not wildcards. Wildcards often preclude the use of indexes and are very poor performers. — HLGEM, Jun 11 '14 at 20:23
@HLGEM well yeah i'm probably going to go with full text search, but after some tests i had approximately the same results, except now... — streem, Jun 11 '14 at 20:29
@fred Nope if you mean something like `select * from keyword where value like 'LOREM%';` it's a few milliseconds. — streem, Jun 11 '14 at 20:32
Don't forget that when joining you want to have an index on the joined columns. You say you have an index on ELEMENT_KEYWORD(KEYWORD_ID), but do you have an index on KEYWORD(ID) ? But based on the execution plan, you are checking 194M rows -- so I don't think you are using the ELEMENT_KEYWORD index at all. — MJB, Jun 11 '14 at 21:03
@MJB I have a primary key on keyword(id). Yes i believe you're right, slowness is probably because it's checking the 194M rows.I believe it should scan the element_keyword index before like the first request, but it seems to do it after. — streem, Jun 11 '14 at 21:13
@fred If you mean a multicolumn index, no. I have an index on keyword(value) and a primary key on keyword(id). — streem, Jun 11 '14 at 21:21
I don't know postgree but in SQL server it's better to include the id column in the index. — fred, Jun 11 '14 at 21:27
as a troubleshooting step, what if you issue `set enable_seqscan to off;` before executing the second query? Does it use the index then? — maniek, Jun 13 '14 at 10:51
@maniek Wow, it seems to work really well. If you could explain a bit more in an answer i'll definitely accept it. — streem, Jun 14 '14 at 09:26

Erwin Brandstetter · Answer 1 · 2023-08-01T19:03:58.283

The reason is this:

WHERE keyword.value like 'LOREM'

is a pointless use case for the LIKE operator. Without wildcards (or escaped characters) this is effectively the same as:

WHERE keyword.value = 'LOREM'

.. which can use an index for equality - thus a plain B-Tree index.

If you are satisfied with matching leading characters (left anchored search pattern), a B-Tree index with the the operator class text_pattern_ops or with COLLATE "C" would serve well. See:

Full-text search in Postgres or CouchDB?

For arbitrary pattern matching use the pg_trgm module:

PostgreSQL LIKE query performance variations

Full text search may or may not be what you want. It is based on dictionaries and stemming, not so much on text patterns like your example suggests.

Overview over pattern matching in Postgres.

Also, a multicolumn index on keyword (value,id) (order of columns is relevant) could enable an index-only scan:

Well, i agree with the first part, it was more to bring out the difference. I'm satisfied with a left anchored search pattern and i already have text_pattern_ops index. I've updated my post to show how i created my index. — streem, Jun 12 '14 at 17:32

score 1 · Answer 2 · answered Jun 11 '14 at 21:57

1

I believe you need a *_pattern_ops index (see http://www.postgresql.org/docs/9.3/static/indexes-opclass.html):

create index pattern_idx on keyword(value text_pattern_ops)

answered Jun 11 '14 at 21:57

maniek

7,087
2
20
43

I already have this index on value. I believe if i did not, a request like `select * from keyword where value like 'LOREM%'; `would not use the index. However it does. – streem Jun 12 '14 at 17:34

score 0 · Accepted Answer · answered Jun 20 '14 at 13:30

Basically what happens is that the second query is doing a sequential scan (for some reasons i don't understand). This sequential scan is what's taking so long.

Disabling the sequential scan forces the query use the index. So if i execute this line before my query it becomes really quick.

set enable_seqscan to off;

Slow join query with wildcard

3 Answers3