2

I would like to know what kind of similarity function is used in the case of PostgreSQL pg_trgm extension. My initial assumption was that it computes the similarity of two strings s1 and s2 using the following formula:

sim(s1, s2) = |G3(s1) ⋂ G3(s2)| / max(|G(s1)|, |G(s2)|)

where G3 is a set of 3-grams for a string. I tried several examples and it seems that the computation is somehow different in PostgreSQL.

create extension pg_trgm;

create table doc (
    word text
);

insert into doc values ('bbcbb');

select *, similarity(word, 'bcb') from doc;

The above example returns 0.25. However,

G3('bbcbb') = {##b, #bb, bbc, bcb, cbb, bb#, b##}
G3('bcb') = {##b, #bc, bcb, cb#, b##}
|G3(s1) ⋂ G3(s2)| = 3
max(|G(s1)|, |G(s2)|) = 7

therefore the sim formula does not return 0.25. What is the correct formula?

Radim Bača
  • 10,646
  • 1
  • 19
  • 33
  • Why don't you just check [the source code](https://github.com/postgres/postgres/tree/master/contrib/pg_trgm)? –  Aug 23 '21 at 07:19

1 Answers1

3

I think u need to check this: PostgreSQL, trigrams and similarity

Summ:

SELECT show_trgm('bbcbb'), show_trgm('bcb')

'bbcbb' = {" b"," bb","bb ",bbc,bcb,cbb}
'bcb' = {" b"," bc",bcb,"cb "}

'bbcbb' -> 6 element 'bcb' -> 4 element

SELECT UNNEST(show_trgm('bbcbb')) INTERSECT SELECT UNNEST(show_trgm('bcb'))

Result: 2 elements are common

SELECT UNNEST(show_trgm('bbcbb')) UNION SELECT UNNEST(show_trgm('bcb'))

Result: 8 elements

Similarity 2/8 = 0.25

UPDATE: https://www.postgresql.org/docs/13/pgtrgm.html

In the F.31.1. Trigram (or Trigraph) Concepts section

There is a note says:

"...Each word is considered to have two spaces prefixed and one space suffixed when determining the set of trigrams contained in the string..."

  • Ok, it makes sense. I do not understand why `ps_trgm` does not generate the last trigram, but now the computation is correct. Thanks. – Radim Bača Aug 23 '21 at 11:30
  • 1
    The formula from the source is `count / (len1 + len2 - count)`, where `count` is the number of common trigrams and `len1` and `len2` are the number of trigrams for the strings. That is exactly the same as what you say. – Laurenz Albe Aug 24 '21 at 14:03
  • @LaurenzAlbe Thanks that makes the computation even more clear – Radim Bača Aug 25 '21 at 13:14