In a longer string, how to convert a char (e.g., 'a') and a diacritic char (e.g., 'U+030A') into a single char ('å') in PostgreSQL (and/or Python 3)

Question

I was using the initcap() function in postgres when I noticed that for some names, like gunnar pålsen, it would output Gunnar PåLsen (with a capital L), other times it would do it correctly for the same name. While investigating I notice that the å was not in fact a single char (U+00E5), but rather two chars: a (U+0061) and the diacritic U+030A (ref).

Is there an simple/elegant way to convert from two chars like this to a single char? It should work for other versions too, like é and ö.

My current solution looks like this:

select
    initcap(
        replace(
            replace(
                replace(
                    replace(
                        replace(
                            replace(
                                replace(
                                    lower(person.full_name),
                                    'á', 'á'
                                ),
                                'ö', 'ö'
                            ),
                            'å', 'å'
                        ),
                        'é', 'é'
                    ),
                    'è', 'è'
                ),
                'ä', 'ä'
            ),
            'ü', 'ü'
        )
    ) as full_name
from person

This isn't very robust, new combinations might show up. (Notice that it seems the copy-paste here has managed to convert the chars, so assume the one on the left is two chars one where there is a diacritic.)

I also tried to use translate(), but I don't think it will work because it seems to require translation between single chars, while in this case we go from two to one.

A solution for python is also acceptable. Part of the dataflow goes via python so I can fix it there if needed. A postgres solution is preferable.

Unicode calls this [Unicode normalization](http://www.unicode.org/reports/tr15/). — Boris Verkhovskiy, Jul 30 '21 at 20:53
Thanks it seems to be this in postgres: SELECT NORMALIZE('Pål'); — André C. Andersen, Jul 30 '21 at 20:56
@Boris Yes it does. Seems like I simply didn't know the right word to search for. Not sure if we maybe should just close this question. If you would like to answer I can accept it as the correct answer. — André C. Andersen, Jul 30 '21 at 21:00
I flagged it as a duplicate, isn't there a way for you to accept the flag? Or maybe you can click the grey "Close" under the question and flag it as a duplicate too. — Boris Verkhovskiy, Jul 30 '21 at 21:03
@Boris yes I found the button. It should now be marked as a duplicate. Thanks again. — André C. Andersen, Jul 30 '21 at 21:05

score 1 · Answer 1 · answered Jul 30 '21 at 21:04

1

Based on the comment of @Boris, the following works in postgres:

select initcap(normalize('gunnar pålsen')) as full_name

Outputting the expected Gunnar Pålsen.

answered Jul 30 '21 at 21:04

André C. Andersen

8,955
3
53
79

In a longer string, how to convert a char (e.g., 'a') and a diacritic char (e.g., 'U+030A') into a single char ('å') in PostgreSQL (and/or Python 3)

1 Answers1