I would like to know what kind of similarity function is used in the case of PostgreSQL pg_trgm
extension. My initial assumption was that it computes the similarity of two strings s1 and s2 using the following formula:
sim(s1, s2) = |G3(s1) ⋂ G3(s2)| / max(|G(s1)|, |G(s2)|)
where G3 is a set of 3-grams for a string. I tried several examples and it seems that the computation is somehow different in PostgreSQL.
create extension pg_trgm;
create table doc (
word text
);
insert into doc values ('bbcbb');
select *, similarity(word, 'bcb') from doc;
The above example returns 0.25. However,
G3('bbcbb') = {##b, #bb, bbc, bcb, cbb, bb#, b##}
G3('bcb') = {##b, #bc, bcb, cb#, b##}
|G3(s1) ⋂ G3(s2)| = 3
max(|G(s1)|, |G(s2)|) = 7
therefore the sim
formula does not return 0.25. What is the correct formula?