3

I have table full of strings (TEXT) and I like to get all the strings that are substrings of any other string in the same table. For example if I had these three strings in my table:

WORD        WORD_ID
cup         0
cake        1
cupcake     2

As result of my query I would like to get something like this:

WORD        WORD_ID        SUBSTRING        SUBSTRING_ID
cupcake     2              cup              0
cupcake     2              cake             1 

I know that I could do this with two loops (using Python or JS) by looping over every word in my table and match it against every word in the same table, but I'm not sure how this can be done using SQL (PostgreSQL for that matter).

halfer
  • 19,824
  • 17
  • 99
  • 186
Maximilian Lindsey
  • 809
  • 4
  • 18
  • 35
  • Looks like a join with condition using `word LIKE '%'+substring+'%'`. – shawnt00 Nov 13 '15 at 23:19
  • Thanks guys for your comments, which pointed me in the right direction. – Maximilian Lindsey Nov 13 '15 at 23:26
  • This query can be extremely expensive for big tables. Please clarify: is your title what you need (`check if a string is the substring of any other string in the same table`), or is the example what you need (list all matching combinations)? The semantic difference is subtle, but the difference in result and performance is potentially ***huge***. Either way, there are better solutions, yet. – Erwin Brandstetter Nov 14 '15 at 02:13
  • @shawnt00: the operator to concatenate strings is `||` in SQL, not `+` –  Nov 14 '15 at 08:20
  • @a_horse It's just a habit from all my years with SQL Server. I think the idea was still there. And actually I had noticed that Gordon even mixed the two in his own answer :) – shawnt00 Nov 14 '15 at 08:32

3 Answers3

4

Use self-join:

select w1.word, w1.word_id, w2.word, w2.word_id
from words w1
join words w2
on w1.word <> w2.word
and w1.word like format('%%%s%%', w2.word);

  word   | word_id | word | word_id 
---------+---------+------+---------
 cupcake |       2 | cup  |       0
 cupcake |       2 | cake |       1
(2 rows)
klin
  • 112,967
  • 15
  • 204
  • 232
1

Problem

The task has the potential to stall your database server for tables of non-trivial size, since it's an O(N²) problem as long as you cannot utilize an index for it.

In a sequential scan you have to check every possible combination of two rows, that's n * (n-1) / 2 combinations - Postgres will run n * n-1 tests since it's not easy to rule out reverse duplicate combinations. If you are satisfied with the first match, it gets cheaper - how much depends on data distribution. For many matches, Postgres will find a match for a row early and can skip testing the rest. For few matches, most of the checks have to be performed anyway.

Either way, performance deteriorates rapidly with the number of rows in the table. Test each query with EXPLAIN ANALYZE and 10, 100, 1000 etc. rows in the table to see for yourself.

Solution

Create a trigram index on word - preferably GIN.

CREATE INDEX tbl_word_trgm_gin_idx ON tbl USING gin (word gin_trgm_ops);

Details:

The queries in both answers so far wouldn't use the index even if you had it. Use a query that can actually work with this index:

To list all matches (according to the question body):
Use a LATERAL CROSS JOIN:

SELECT t2.word_id, t2.word, t1.word_id, t1.word
FROM   tbl t1
     , LATERAL (
   SELECT word_id, word
   FROM   tbl
   WHERE  word_id <> t1.word_id
   AND    word like format('%%%s%%', t1.word)
   ) t2;

To just get rows that have any match (according to your title): Use an EXISTS semi-join:

SELECT t1.word_id, t1.word
FROM   tbl t1
WHERE EXISTS (
   SELECT 1
   FROM   tbl
   WHERE  word_id <> t1.word_id
   AND    word like format('%%%s%%', t1.word)
   );
Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • . . Cool, I didn't realize that the trigram index would work with `like`. For words with one or two characters, I'm not sure if the trigrams would help. – Gordon Linoff Nov 14 '15 at 12:33
  • @GordonLinoff: It still would - if an index makes sense to begin with. (One-letter strings like `'a'` are typically contained in a large percentage of rows, an index would be useless for this anyway, as you know). But generally, trigrams are build with a trick of prepending and appending space characters internally, so they work for just 1 or 2 letters as well. – Erwin Brandstetter Nov 14 '15 at 13:49
  • 1
    . . The issue is that "_ca" isn't going to match anything in "Cupcake". With three characters, there is a match. But if the word were "ca" the trigrams would be "__c", "_ca", and "ca_", none of which match anything in "Cupcake". By the way, I very much like the answer, the comment is only a clarification. It would be nice if pg_trgm supported n-grams, where "n" could be input by the user. – Gordon Linoff Nov 14 '15 at 13:56
  • @GordonLinoff: Why would "ca" not match "Cupcake"? It matches, and a trigram index can be used for this. (But Postgres will only chose a bitmap index scan over a sequential scan if the string is estimated to be selective enough. You can force it for debugging purposes with `set enable_seqscan = false`.) – Erwin Brandstetter Nov 15 '15 at 02:15
  • 1
    . . "Cupcake" has no trigram equal to "ca". It has "pca" and "cak", but not "_ca". Does the index handle partial matches? – Gordon Linoff Nov 15 '15 at 04:37
  • @GordonLinoff: You are right, I quote the manual: `For both LIKE and regular-expression searches, keep in mind that a pattern with no extractable trigrams will degenerate to a full-index scan.` The reason: trigrams with multibyte characters are replaced by a 3-byte hash that won't allow partial matching. It still works reliably in *any* case, the query plan can still report an index scan (or bitmap index scan more likely) and typically it's not going to be slower because most 1- or 2-letter patterns are common enough that a sequential scan is faster anyway or at least not much slower. – Erwin Brandstetter Nov 15 '15 at 19:09
  • . . Thank you for following up. I've implemented trigrams myself (even in databases) but I haven't really used them in Postgres. It is a powerful capability. Happily, the underlying code does the right thing even when there is no match. – Gordon Linoff Nov 15 '15 at 23:22
0

I would approach this as:

select w1.word_id, w1.word, w2.word_id as substring_id w2.word as substring
from words w1 join
     words w2
     on w1.word like '%' || w2.word || '%' and w1.word <> w2.word;

Note: this is probably a bit faster than doing the loop in the application. However, this query will be implemented as a nested loop in Postgres, so it won't be blazingly fast.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786