I have Strings containing two similar looking characters. Both appear as small 'a's with an ogonek:
ą
ą
(Note: depending on the renderer they are sometimes rendered similarily, sometimes slightly differently)
However, they are different:
Characteristics of the 1st character:
In PostgreSQL:
select ascii('ą');
ascii
-------
261
The UTF-8-encoding in Hex is: \xC4\x85
so it is a precomposed character (https://en.wikipedia.org/wiki/Precomposed_character)
Characteristics of the 2nd character:
In PostgreSQL:
select ascii('ą');
ascii
-------
97
(same as character 'a')
That strongly indicates that the rendered character is combined out of two characters. And it is indeed:
The UTF-8-encoding in Hex is: \x61\xCC\xA8
So it is a combination of
a \x61\
and a combining character (https://en.wikipedia.org/wiki/Combining_character), the separate ogonek:
̨ \xCC\xA8
I want to use PostgreSQL's levenshtein function to determine the similarity of words, and so I want treat both characters as the same (as it is of course intended by people who write the name of a distinctive entity either with the 1st or the 2nd character).
I assumed that I can use unaccent to always get rid of the ogonek, but that is not working in the 2nd case:
1st character: expected result:
select levenshtein('ą', 'x');
levenshtein
-------------
1
1st character: expected result:
select levenshtein(unaccent('ą'), 'x');
levenshtein
-------------
1
2nd character: expected result:
select levenshtein('ą', 'x');
levenshtein
-------------
2
2nd character: unexpected result:
select levenshtein(unaccent('ą'), 'x');
levenshtein
-------------
2
So, when I compare both characters with levenshtein and unaccent, the result is 1:
select levenshtein(unaccent('ą'), unaccent('ą'));
levenshtein
-------------
1
instead of 0.
How can I "get rid of the ogonek" in the 2nd case?
(How) can I use the UTF-8 codes of Strings to get the achieved result?
Edit: As @s-man suggested, adding the combining character to unaccent.rules
would solve this particular problem. But to generally solve the precomposed character vs. combined character problem with unaccent, I would have to explicitly add/modify every missing/"misconfigured" combined character to/in the config.