1

I am writing a macro in dbt with SQL to clean names. I elegantly wanted to upper the first letter of the names but my

regexp_replace('(\w)(\w*)', x -> upper(x[1]) || lower(x[2])

collides with the German umlauts ä, ö, ü

So for example the last name schöneberger with my regex expression from above becomes SchöNeberger and not Schöneberger.

Does someone know what to write so I can upper Schöneberger and other name with umlauts as well?

DataVE
  • 59
  • 8

1 Answers1

1

Athena uses Trino syntax, which uses Java regex syntax. Java supports the extended character classes using Unicode properties from Perl, including \p{L}, which is basically "any Unicode letter." So this will work for you:

regexp_replace(name_col, '(\p{L})(\p{L}*)', x -> upper(x[1]) || lower(x[2]))

Proof: https://regex101.com/r/N84wjS/2

tconbeer
  • 4,570
  • 1
  • 9
  • 21