3

pg_trgm gives me a score of 0.4 for both of these comparisons :

SELECT similarity('Noemie','Noémie');
0.4 

SELECT similarity('Noemie','NoXmie');
0.4 

Obviously the first one is more "similar" than the second, accents are often ommited in data entry, so it's quite useful to have a score that gives high similarity to letters that vary only by presence of absence of a accent.

Is their a way to tweak pg_trgm to give higher similarity score for words that differ only by accents ?

Max L.
  • 9,774
  • 15
  • 56
  • 86

1 Answers1

3

I would start by suggesting that you remove the accents from your character set. Postgres offers a function to do this, unaccent(), but you need to install it separately. Here is information on the topic.

With this function (or a similar function), you could do:

SELECT similarity(unaccent('Noemie'), unaccent('Noémie'));

Treating the two values the same might be going too far. A weighted average of the two might be more appropriate:

SELECT (alpha * similarity(unaccent('Noemie'), unaccent('Noémie'0)) + 
        (1 - alpha) * similarity('Noemie', 'Noémie')
       )

alpha would be a value between 0 and 1 that gives the weighting for similarity of accented characters.

Here is a good discussion of this issue.

Community
  • 1
  • 1
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786