I was using the initcap()
function in postgres when I noticed that for some names, like gunnar pålsen
, it would output Gunnar PåLsen
(with a capital L
), other times it would do it correctly for the same name. While investigating I notice that the å
was not in fact a single char (U+00E5
), but rather two chars: a
(U+0061
) and the diacritic U+030A
(ref).
Is there an simple/elegant way to convert from two chars like this to a single char? It should work for other versions too, like é
and ö
.
My current solution looks like this:
select
initcap(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
lower(person.full_name),
'á', 'á'
),
'ö', 'ö'
),
'å', 'å'
),
'é', 'é'
),
'è', 'è'
),
'ä', 'ä'
),
'ü', 'ü'
)
) as full_name
from person
This isn't very robust, new combinations might show up. (Notice that it seems the copy-paste here has managed to convert the chars, so assume the one on the left is two chars one where there is a diacritic.)
I also tried to use translate()
, but I don't think it will work because it seems to require translation between single chars, while in this case we go from two to one.
A solution for python is also acceptable. Part of the dataflow goes via python so I can fix it there if needed. A postgres solution is preferable.