SQL select rows containing substring in text field

Question

I have CLIENTS_WORDS table with columns: ID, CLIENT_ID, WORD in Postgresql database

ID|CLIENT_ID|WORD
1 |1242     |word1
2 |1242     |WordX.foo
3 |1372     |nextword
4 |1999     |word1

In this table possible about 100k-500k rows.
I have query string like this:

'Some people tell word1 to someone'
'Another stringWordX.foo too possible'

I wish select * from table where WORD column text contains in query string.
Now I use select

select * from CLIENTS_WORDS
where strpos('Some people tell word1 to someone', WORD) > 0

My question, where is the best perfomance/fast way to retrieve matched rows?

Does lower case / upper case matter? What is your version of Postgres? — Erwin Brandstetter, Feb 17 '14 at 15:06

Erwin Brandstetter · Accepted Answer · 2023-08-01T18:59:14.887

You get better performance with unnest() and JOIN. Like this:

SELECT DISTINCT c.client_id
FROM   unnest(string_to_array('Some people tell word1 ...', ' ')) AS t(word)
JOIN   clients_words c USING (word);

Details of the query depend on missing details of your requirements. This is splitting the string at space characters.

A more flexible tool would be regexp_split_to_table(), where you can use character classes or shorthands for your delimiter characters. Like:

regexp_split_to_table('Some people tell word1 to someone', '\s') AS t(word)
regexp_split_to_table('Some people tell word1 to someone', '\W') AS t(word)

Of course the column clients_words.word needs to be indexed for performance:

CREATE INDEX clients_words_word_idx ON clients_words (word)

Would be very fast.

Ignore word boundaries

If you want to ignore word boundaries altogether, the whole matter becomes much more expensive. LIKE / ILIKE in combination with a trigram GIN index would come to mind. See:

However, your case is backwards and the index is not going to help. You'll have to inspect every single row for a partial match - making queries very expensive. The superior approach is to reverse the operation: split words and then search.

Its works if searchable word can be split by ' ' space but if try 'Another stringWordX.foo too possible' will not match WordX.foo — Dmitry, Feb 17 '14 at 15:21
@Dmitry: I added a bit more about regular expressions and pattern matching. — Erwin Brandstetter, Feb 17 '14 at 15:35

SQL select rows containing substring in text field

1 Answers1

Ignore word boundaries

Linked