3

I have a keyword table containing a few millions entries. It is linked by a many to many relationship to an element table. I would like to get all the element ids matching a keyword. I tried this query and there is no problem, it returns rows in few milliseconds.

SELECT element_id FROM element_keyword 
JOIN keyword ON keyword.id = element_keyword.keyword_id
WHERE keyword.value like 'LOREM';   

Execution plan

"Nested Loop  (cost=278.50..53665.56 rows=65 width=4)"
"  ->  Index Scan using keyword_value_index on keyword  (cost=0.43..8.45 rows=1 width=4)"
"        Index Cond: ((value)::text = 'LOREM'::text)"
"        Filter: ((value)::text ~~ 'LOREM'::text)"
"  ->  Bitmap Heap Scan on element_keyword  (cost=278.07..53510.66 rows=14645 width=8)"
"        Recheck Cond: (keyword_id = keyword.id)"
"        ->  Bitmap Index Scan on element_keyword_keyword_index  (cost=0.00..274.41 rows=14645 width=0)"
"              Index Cond: (keyword_id = keyword.id)"

However when i put a wild card at the end of my search string the request becomes really slow. (~60000ms)

SELECT element_id FROM element_keyword 
JOIN keyword ON keyword.id = element_keyword.keyword_id
WHERE keyword.value like 'LOREM%';          

Execution plan:

"Hash Join  (cost=12.20..3733738.08 rows=19502 width=4)"
"  Hash Cond: (element_keyword.keyword_id = keyword.id)"
"  ->  Seq Scan on element_keyword  (cost=0.00..3002628.08 rows=194907408 width=8)"
"  ->  Hash  (cost=8.45..8.45 rows=300 width=4)"
"        ->  Index Scan using keyword_value_index on keyword  (cost=0.43..8.45 rows=300 width=4)"
"              Index Cond: (((value)::text ~>=~ 'LOREM'::text) AND ((value)::text ~<~ 'LOREN'::text))"
"              Filter: ((value)::text ~~ 'LOREM%'::text)"

Even when the wildcard does not give more result, the query is slow.

I created an index on keyword(value) and element_keyword(keyword_id)

CREATE INDEX "keyword_value_index" ON "keyword" (value text_pattern_ops);
CREATE INDEX "element_keyword_keyword_index" ON "element_keyword" (keyword_id);

What's really happening behind ? How could i solve it ?

UPDATE

I'm not sure if this could help but here's a few more tests :

select id from keyword where value like 'LOREM%'; 
-> 6 ids retrieved in 17ms 

select * from element_keyword where keyword_id in (1961746,1961710,2724258,2121442,1633163,1026116); 
-> 40 rows retrieved in 17ms

select * from element_keyword where keyword_id in (select id from keyword where value like 'LOREM%');
-> 40 rows in 63221 ms
streem
  • 9,044
  • 5
  • 30
  • 41
  • Can you also show the Execution plan for this query? – M.Ali Jun 11 '14 at 20:17
  • 5
    well to start with if you are designing your database so that you need to do wildcard searches, then you need full text search not wildcards. Wildcards often preclude the use of indexes and are very poor performers. – HLGEM Jun 11 '14 at 20:23
  • @HLGEM well yeah i'm probably going to go with full text search, but after some tests i had approximately the same results, except now... – streem Jun 11 '14 at 20:29
  • or you might be happier with a nosql database – HLGEM Jun 11 '14 at 20:30
  • Is the query slow if you don't join with the other table ? – fred Jun 11 '14 at 20:31
  • @fred Nope if you mean something like `select * from keyword where value like 'LOREM%';` it's a few milliseconds. – streem Jun 11 '14 at 20:32
  • Don't forget that when joining you want to have an index on the joined columns. You say you have an index on ELEMENT_KEYWORD(KEYWORD_ID), but do you have an index on KEYWORD(ID) ? But based on the execution plan, you are checking 194M rows -- so I don't think you are using the ELEMENT_KEYWORD index at all. – MJB Jun 11 '14 at 21:03
  • @MJB I have a primary key on keyword(id). Yes i believe you're right, slowness is probably because it's checking the 194M rows.I believe it should scan the element_keyword index before like the first request, but it seems to do it after. – streem Jun 11 '14 at 21:13
  • 1
    Do you have any index on the table keyword(value, id)? – fred Jun 11 '14 at 21:19
  • @fred If you mean a multicolumn index, no. I have an index on keyword(value) and a primary key on keyword(id). – streem Jun 11 '14 at 21:21
  • 1
    I don't know postgree but in SQL server it's better to include the id column in the index. – fred Jun 11 '14 at 21:27
  • as a troubleshooting step, what if you issue `set enable_seqscan to off;` before executing the second query? Does it use the index then? – maniek Jun 13 '14 at 10:51
  • @maniek Wow, it seems to work really well. If you could explain a bit more in an answer i'll definitely accept it. – streem Jun 14 '14 at 09:26

3 Answers3

2

The reason is this:

WHERE keyword.value like 'LOREM'

is a pointless use case for the LIKE operator. Without wildcards (or escaped characters) this is effectively the same as:

WHERE keyword.value = 'LOREM'

.. which can use an index for equality - thus a plain B-Tree index.

If you are satisfied with matching leading characters (left anchored search pattern), a B-Tree index with the the operator class text_pattern_ops or with COLLATE "C" would serve well. See:

For arbitrary pattern matching use the pg_trgm module:

Full text search may or may not be what you want. It is based on dictionaries and stemming, not so much on text patterns like your example suggests.

Also, a multicolumn index on keyword (value,id) (order of columns is relevant) could enable an index-only scan:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Well, i agree with the first part, it was more to bring out the difference. I'm satisfied with a left anchored search pattern and i already have text_pattern_ops index. I've updated my post to show how i created my index. – streem Jun 12 '14 at 17:32
1

I believe you need a *_pattern_ops index (see http://www.postgresql.org/docs/9.3/static/indexes-opclass.html):

create index pattern_idx on keyword(value text_pattern_ops)
maniek
  • 7,087
  • 2
  • 20
  • 43
  • I already have this index on value. I believe if i did not, a request like `select * from keyword where value like 'LOREM%'; `would not use the index. However it does. – streem Jun 12 '14 at 17:34
0

Basically what happens is that the second query is doing a sequential scan (for some reasons i don't understand). This sequential scan is what's taking so long.

Disabling the sequential scan forces the query use the index. So if i execute this line before my query it becomes really quick.

set enable_seqscan to off;
streem
  • 9,044
  • 5
  • 30
  • 41