pg_trgm how to give higher similarity score when only accents vary

Question

pg_trgm gives me a score of 0.4 for both of these comparisons :

SELECT similarity('Noemie','Noémie');
0.4 

SELECT similarity('Noemie','NoXmie');
0.4

Obviously the first one is more "similar" than the second, accents are often ommited in data entry, so it's quite useful to have a score that gives high similarity to letters that vary only by presence of absence of a accent.

Is their a way to tweak pg_trgm to give higher similarity score for words that differ only by accents ?

score 3 · Accepted Answer · edited May 23 '17 at 12:32

I would start by suggesting that you remove the accents from your character set. Postgres offers a function to do this, unaccent(), but you need to install it separately. Here is information on the topic.

With this function (or a similar function), you could do:

SELECT similarity(unaccent('Noemie'), unaccent('Noémie'));

Treating the two values the same might be going too far. A weighted average of the two might be more appropriate:

SELECT (alpha * similarity(unaccent('Noemie'), unaccent('Noémie'0)) + 
        (1 - alpha) * similarity('Noemie', 'Noémie')
       )

alpha would be a value between 0 and 1 that gives the weighting for similarity of accented characters.

Here is a good discussion of this issue.

pg_trgm how to give higher similarity score when only accents vary

1 Answers1