Query performance in PostgreSQL using 'similar to'

Question

I need to retrieve certain rows from a table depending on certain values in a specific column, named columnX in the example:

select *
from tableName 
where columnX similar to ('%A%|%B%|%C%|%1%|%2%|%3%')

So if columnX contains at least one of the values specified (A, B, C, 1, 2, 3), I will keep the row.

I can't find a better approach than using similar to. The problem is that the query takes too long for a table with more than a million rows.

I've tried indexing it:

create index tableName_columnX_idx on tableName (columnX) 
where columnX similar to ('%A%|%B%|%C%|%1%|%2%|%3%')

However, if the condition is variable (the values could be other than A, B, C, 1, 2, 3), I would need a different index for each condition.

Is there any better solution for this problem?

EDIT: Thanks everybody for the feedback. Looks like I've achieved to this point maybe because of a design mistake (topic I've posted in a separated question).

are you always going to search over single-character values? Is a query like `'%ABC%|%BCD%'` possible? — Quassnoi, Nov 01 '12 at 19:01
@FedericoCristina effectively, `columnX` is a *set* represented by single-character codes in a `text` field. Correct? This implies that the order of values in `columnX` does not matter. — Craig Ringer, Nov 02 '12 at 00:06

score 4 · Answer 1 · answered Nov 01 '12 at 19:13

4

If you are only going to search lists of one-character values, then split each string into an array of characters and index the array:

CREATE INDEX
        ix_tablename_columnxlist
ON      tableName
USING   GIN((REGEXP_SPLIT_TO_ARRAY(columnX, '')))

then search against the index:

SELECT  *
FROM    tableName
WHERE   REGEXP_SPLIT_TO_ARRAY(columnX, '') && ARRAY['A', 'B', 'C', '1', '2', '3']

answered Nov 01 '12 at 19:13

Quassnoi

413,100
91
616
614

1

A GIN index can be quite expensive to maintain, so it isn't particularly suitable for tables with high rates of inserts/updats/deletes. This may or may not be a problem for this particular application. – Craig Ringer Nov 01 '12 at 23:53
1

Thanks for your answer. Unfortunately these tables are constantly changing (inserts & updated) – Federico Cristina Nov 02 '12 at 12:18
3

@FedericoCristina: not aware of your scenario's details, however, in most cases frequent full scans imply more load on the server than index updates (even `GIN`). Have you actually tried both solutions? Also if you consider a redesign and are able to install extensions, you could store your flags in a native `INTARRAY` and index it with much more update-friendly `GIST`. – Quassnoi Nov 02 '12 at 12:35

aib · Answer 2 · 2012-11-02T01:19:34.947

2

I'll post this as an answer because it may guide other people in the future: Why not have 6 columns, haveA, haveB ~ have3 and do a 6-part OR query? Or use a bitmask?

If there are too many attributes to assign a column each, I might try creating an "attribute" table:

(fkey, attr) VALUES (1, 'A'), (1, 'B'), (2, '3')

and let the DBMS worry about the optimization.

edited Nov 02 '12 at 01:19

answered Nov 01 '12 at 19:07

aib

45,516
10
73
79

1

Thanks for your answer. Maybe I should have explained better, those values (A, B, C, 1, 2, 3) where just for the example. The complete set of options is about 24 different ones. – Federico Cristina Nov 01 '12 at 19:10
1

I'm not sure the last bit is right. If `columnX` has value `ZAB` and you're searching for `A|B` that won't match, but should. – Craig Ringer Nov 02 '12 at 00:08
1

oh yeah, sorry. Attribute it to ~40 hours of sleeplessness please :) – aib Nov 02 '12 at 01:19

Erwin Brandstetter · Accepted Answer · 2023-08-01T19:11:45.947

I agree with Quassnoi, a GIN index is fastest and simplest - unless write performance or disk space are issues, because it occupies a lot of space and adds cost for INSERT, UPDATE and DELETE.

My additional answer is triggered by your statement:

I can't find a better approach than using similar to.

If that is what you found, then your search isn't over, yet. SIMILAR TO is a complete waste of time. Literally. Postgres only includes it to comply to the (weird) SQL standard. Inspect the output of EXPLAIN ANALYZE for your query and you will find that SIMILAR TO has been replaced by a regular expression.

Internally every SIMILAR TO expression is rewritten to a regular expression. Consequently, for each and every SIMILAR TO expression there is at least one regular expression match that is a bit faster. Let EXPLAIN ANALYZE translate it for you, if you are not sure. You won't find this in the manual, PostgreSQL does not promise to do it this way, but I have yet to see an exception.

Further reading:

Thanks for the link! You're right about using SIMILAR TO. Maybe I should redesing the solution in order to avoid this scenario. I've edited my question with a reference to a new domain-specific question. — Federico Cristina, Nov 02 '12 at 12:15

score 2 · Answer 4 · edited May 23 '17 at 12:09

This strikes me as a data modelling issue. You appear to be using a text field as a set, storing single character codes to identify values present in the set.

If so, I'd want to remodel this table to use one of the following approaches:

Standard relational normalization. Drop columnX, and replace it with a new table with a foreign key reference to tableName(id) and a charcode column that contains one character from the old columnX per row, like CREATE TABLE tablename_columnx_set(tablename_id integer not null references tablename(id), charcode "char", primary key (tablename_id, charcode)). You can then fairly efficiently search for keys in columnX using normal SQL subqueries, joins, etc. If your application can't cope with that change you could always keep columnX and maintain the side table using triggers.
Convert columnX to a hstore of keys with a dummy value. You can then use hstore operators like columnX ?| ARRAY['A','B','C']. A GiST index on the hstore of columnX should provide fairly solid performance for those operations.
Split to an array as recommended by Quassnoi if your table change rate is low and you can pay the costs of the GIN index;
Convert columnX to an array of integers, use intarray and the intarray GiST index. Have a mapping table of codes to integers or convert in the application.

Time permitting I'll follow up with demos of each. Making up the dummy data is a pain, so it'll depend on what else is going on.

Looks like you're right about the modelling issue. I should do some refactor there. I've edited my question with a reference to a new domain-specific question. — Federico Cristina, Nov 02 '12 at 12:13

Query performance in PostgreSQL using 'similar to'

4 Answers4

Linked