Is there a multibyte-aware Postgresql Levenshtein?

Question

When I use the fuzzystrmatch levenshtein function with diacritic characters it returns a wrong / multibyte-ignorant result:

select levenshtein('ą', 'x');
levenshtein 
-------------
       2

(Note: the first character is an 'a' with a diacritic below, it is not rendered properly after I copied it here)

The fuzzystrmatch documentation (https://www.postgresql.org/docs/9.1/fuzzystrmatch.html) warns that:

At present, the soundex, metaphone, dmetaphone, and dmetaphone_alt functions do not work well with multibyte encodings (such as UTF-8).

But as it does not name the levenshtein function, I was wondering if there is a multibyte aware version of levenshtein.

I know that I could use unaccent function as a workaround but I need to keep the diacritics.

Out of curiosity what is your server, database, table and column settings? And what is the Unicode character exactly that you're trying to translate. — JayRizzo, Jun 19 '19 at 22:42
Use [UTF-8 codes](https://www.utf8-chartable.de/unicode-utf8-table.pl?start=256&unicodeinhtml=hex) (U+0105 for ą) instead of [combining characters.](https://en.wikipedia.org/wiki/Combining_character) — klin, Jun 19 '19 at 23:52
@JayRizzo: Server and database encoding is **UTF8**. `\d TABLE` doesn't show the encoding of the table and columns, so I strongly assume it uses the setting for the DB. — Johann Gottfried, Jun 20 '19 at 09:42
@JayRizzo: About the character: See my new question https://stackoverflow.com/q/56683034/11637535. — Johann Gottfried, Jun 20 '19 at 09:48
@klin: How can I do that? Please see my new question https://stackoverflow.com/q/56683034/11637535 where I elaborate on what I am trying to achieve. — Johann Gottfried, Jun 20 '19 at 09:49

score 2 · Accepted Answer · answered Jun 21 '19 at 19:42

Note: This solution was suggested by @Nick Barnes in his answer to a related question.

The 'a' with a diacritic is a character sequence, i.e. a combination of a and a combining character, the diacritic ̨ : E'a\u0328'

There is an equivalent precomposed character ą: E'\u0105'

A solution would be to normalise the Unicode strings, i.e. to convert the combining character sequence into the precomposed character before comparing them.

Unfortunately, Postgres doesn't seem to have a built-in Unicode normalisation function, but you can easily access one via the PL/Perl or PL/Python language extensions.

For example:

create extension plpythonu;

create or replace function unicode_normalize(str text) returns text as $$
  import unicodedata
  return unicodedata.normalize('NFC', str.decode('UTF-8'))
$$ language plpythonu;

Now, as the character sequence E'a\u0328' is mapped onto the equivalent precomposed character E'\u0105' by using unicode_normalize, the levenshtein distance is correct:

select levenshtein(unicode_normalize(E'a\u0328'), 'x');
levenshtein
-------------
           1

[Postgres now includes a `normalize` function (as of version 13)](https://www.postgresql.org/docs/13/functions-string.html#id-1.5.8.10.5.2.2.7.1.1.1). — robjwells, Mar 07 '21 at 12:40

Is there a multibyte-aware Postgresql Levenshtein?

1 Answers1

Linked