Regexp_replace collides with German umlaut ü, ö, ä

Question

I am writing a macro in dbt with SQL to clean names. I elegantly wanted to upper the first letter of the names but my

regexp_replace('(\w)(\w*)', x -> upper(x[1]) || lower(x[2])

collides with the German umlauts ä, ö, ü

So for example the last name schöneberger with my regex expression from above becomes SchöNeberger and not Schöneberger.

Does someone know what to write so I can upper Schöneberger and other name with umlauts as well?

Is this a SQL _language_ related question? Which dbms are you using? — jarlh, Jan 25 '23 at 09:32
jarlh was asking what database is this? What you shared is very strange syntax for regexp_replace() — tconbeer, Jan 25 '23 at 17:26
You will need to tweak your regex to add unicode for the umlauts. Here's a SO question around that, hope it helps: https://stackoverflow.com/questions/22017723/regex-for-umlaut — Aleix CC, Jan 26 '23 at 10:36

tconbeer · Accepted Answer · 2023-01-26T16:40:02.153

1

Athena uses Trino syntax, which uses Java regex syntax. Java supports the extended character classes using Unicode properties from Perl, including \p{L}, which is basically "any Unicode letter." So this will work for you:

regexp_replace(name_col, '(\p{L})(\p{L}*)', x -> upper(x[1]) || lower(x[2]))

Proof: https://regex101.com/r/N84wjS/2

edited Jan 26 '23 at 16:40

answered Jan 26 '23 at 16:31

tconbeer

4,570
1
9
21

Regexp_replace collides with German umlaut ü, ö, ä

1 Answers1