Postgresql levenshtein and precomposed character vs. combined character

Question

I have Strings containing two similar looking characters. Both appear as small 'a's with an ogonek:

ą

ą

(Note: depending on the renderer they are sometimes rendered similarily, sometimes slightly differently)

However, they are different:

Characteristics of the 1st character:

In PostgreSQL:

select ascii('ą');
ascii 
-------
261

The UTF-8-encoding in Hex is: \xC4\x85

so it is a precomposed character (https://en.wikipedia.org/wiki/Precomposed_character)

Characteristics of the 2nd character:

In PostgreSQL:

select ascii('ą');
ascii 
-------
97

(same as character 'a')

That strongly indicates that the rendered character is combined out of two characters. And it is indeed:

The UTF-8-encoding in Hex is: \x61\xCC\xA8

So it is a combination of

a \x61\

and a combining character (https://en.wikipedia.org/wiki/Combining_character), the separate ogonek:

̨ \xCC\xA8

I want to use PostgreSQL's levenshtein function to determine the similarity of words, and so I want treat both characters as the same (as it is of course intended by people who write the name of a distinctive entity either with the 1st or the 2nd character).

I assumed that I can use unaccent to always get rid of the ogonek, but that is not working in the 2nd case:

1st character: expected result:

select levenshtein('ą', 'x');
levenshtein 
-------------
       1

1st character: expected result:

select levenshtein(unaccent('ą'), 'x');
levenshtein 
-------------
       1

2nd character: expected result:

select levenshtein('ą', 'x');
levenshtein 
-------------
       2

2nd character: unexpected result:

select levenshtein(unaccent('ą'), 'x');
levenshtein 
-------------
       2

So, when I compare both characters with levenshtein and unaccent, the result is 1:

select levenshtein(unaccent('ą'), unaccent('ą'));
levenshtein 
-------------
       1

instead of 0.

How can I "get rid of the ogonek" in the 2nd case?

(How) can I use the UTF-8 codes of Strings to get the achieved result?

Edit: As @s-man suggested, adding the combining character to unaccent.rules would solve this particular problem. But to generally solve the precomposed character vs. combined character problem with unaccent, I would have to explicitly add/modify every missing/"misconfigured" combined character to/in the config.

Maybe you should check your configuration: https://www.postgresql.org/docs/current/unaccent.html Maybe both characters have different configs... — S-Man, Jun 20 '19 at 09:55
Yes, the combining character is missing in the config, so adding it would solve this particular problem. But to generally solve the _precomposed character vs. combining character_ problem this way, I would have to explicitly add every (missing) combining character to the config. — Johann Gottfried, Jun 20 '19 at 10:16

score 3 · Accepted Answer · answered Jun 20 '19 at 14:40

Removing accents will give you a Levenshtein distance of 0, but it will also give you a distance of 0 between ą and a, which does not sound ideal.

The better solution would be to normalise the Unicode strings, i.e. to convert the combining character sequence E'a\u0328' into the precomposed character E'\u0105' before comparing them.

Unfortunately, Postgres doesn't seem to have a built-in Unicode normalisation function, but you can easily access one via the PL/Perl or PL/Python language extensions.

For example:

create extension plpythonu;

create or replace function unicode_normalize(str text) returns text as $$
  import unicodedata
  return unicodedata.normalize('NFC', str.decode('UTF-8'))
$$ language plpythonu;

And then:

test=# select levenshtein(unicode_normalize(E'a\u0328'), unicode_normalize(E'\u0105'));
 levenshtein
-------------
           0

This also solves the issue in your previous question, where the combining character was contributing to the Levenshtein distance:

test=# select levenshtein(unicode_normalize(E'a\u0328'), 'x');
 levenshtein
-------------
           1

Thank you! I used your answer to answer my previous, related question. — Johann Gottfried, Jun 21 '19 at 19:44
[Postgres now includes a `normalize` function (as of version 13)](https://www.postgresql.org/docs/13/functions-string.html#id-1.5.8.10.5.2.2.7.1.1.1). — robjwells, Mar 07 '21 at 12:41

score 2 · Answer 2 · answered Jun 20 '19 at 12:02

2

You have to change your configuration and add the missing characters manually at the config file as described in https://postgresql.org/docs/current/unaccent.html

answered Jun 20 '19 at 12:02

S-Man

22,521
7
40
63

Thank you (esp. for your comment above). I upvoted the answer. – Johann Gottfried Jun 20 '19 at 12:08
But as I wrote in the reply to your comment above and in my edit, I am aware that this would solve the problem. However, it is doubtful that I would be able (and willing) to add every missing character to the config. So hopefully there is an alternative solution. – Johann Gottfried Jun 20 '19 at 12:10
I was just thinking: Maybe there is a comprehensive list useable for unaccent.rules that maps combining characters and corresponding precomposed characters to the same target characters. – Johann Gottfried Jun 20 '19 at 12:15
Yes maybe one can find such anywhere. But to create an own... How many characters we are talking about? 100? 1000? Hm, there are only a few "modifiers", accents, umlauts, maybe 10? Then there are only a few letters which can be modified: AEUOILSCGZ... Maybe creating it manually takes not too long time. Am I wrong? Adding some ligatures like ß or AE,... – S-Man Jun 20 '19 at 12:37
1

You don't have to consider every combination of "modifiers" and "letters", it is sufficient to know the "modifiers": see my answer. Thanks again! – Johann Gottfried Jun 20 '19 at 13:51

score 1 · Answer 3 · edited Jun 20 '20 at 09:12

Note: This solution is based on @S-Man's suggestion to explicitly add missing characters to the unaccent.rules file.

Note: A prerequisite of this answer is that the relevant precomposed characters (https://en.wikipedia.org/wiki/Precomposed_character) are already mapped in the unaccent.rules file. If not, they have to be added also.

There are characters which are composed of multiple characters:

a "basic" character (e.g. vowels like a, consonants like l)
a combining character (https://en.wikipedia.org/wiki/Combining_character), typically one diacritic like acute ( ´ ) or dot ( · )

The goal is to map a "multiple character" character on the containing "basic" character.

(assuming the corresponding precomposed characters are mapped to the "basic" character, which is the case in the original unaccent.rules file)

unaccent checks every character in a "multiple character" character for replacement, so it is not necessary to consider every combination of basic character and diacritic.

Instead the diacritics have to be mapped on [nothing]. This can be achieved by leaving the second column in the unaccent.rules file (https://postgresql.org/docs/current/unaccent.html) empty.

This is a list of diacritics for the Latin alphabet obtained from https://en.wikipedia.org/wiki/Diacritic: ´ ˝ ` ̏ ˆ ˇ ˘ ̑ ¸ ¨ · ̡ ̢ ̉ ̛ ͅ ˉ ˛ ͂ ˚ ˳ ῾ ᾿

Add to that the ogonek of the question, which is missing: ̨

Now (after a PostgreSQL restart, of course), unaccent maps "multiple character" characters on the "basic" character, as it does with precomposed characters.

Note: The above list may not be comprehensive, but should at least solve a good part of the "precomposed character vs. combined character" problem.

Diacritics have been added recently to unaccent.rules. If you download `unaccent.rules` from PostgreSQL 12 beta 1 it should already have your suggestion in. It will not be backported to previous versions to keep unaccent() results stable within major versions. — Daniel Vérité, Jun 20 '19 at 14:02

Postgresql levenshtein and precomposed character vs. combined character

3 Answers3

Linked