Removing accents from string in Snowflake

Question

I am trying to remove all accents from a string column in Snowflake. In MSSQL I can use the collate function to do this. For example I have this string in a column: 'JESÚSSMITH12345'. I want to remove the accent from above the U:

In MSSQL I can do the following:

select ('JESÚSSMITH12345' Collate SQL_Latin1_General_CP1253_CI_AI) as identifier

returns: JESUSSMITH12345

However, I can't find a way to do this in Snowflake, other than using the TRANSLATE function. Using the TRANSLATE function isn't a good solution because, I would have to define every possible letter with an accent to do the translation. For example, this works in Snowflake:

select translate('JESÚSSMITH12345', 'Ú', 'U');

returns: JESUSSMITH12345

However, if a column contains any value other than U with an accent, I have to add that to my translation. It's not ideal.

Using the collate function in Snowflake, even with "ai" (a.k.a accent-insensitive), this still returns the string with the accented U.

For example in Snowflake:

select collate('JESÚSSMITH12345', 'en-ai');

returns: JESÚSSMITH12345

Felipe Hoffa · Answer 1 · 2021-09-02T17:12:32.603

3

Update: Better JS line using {Diacritic} by Greg Pavlik

return str.normalize("NFD").replace(/\p{Diacritic}/gu, "");

You can solve this with a JS UDF in Snowflake:

CREATE OR REPLACE FUNCTION normalize_js(S string)
  RETURNS string
  LANGUAGE JAVASCRIPT
  AS 'return S.normalize("NFD").replace(/[\u0300-\u036f]/g, "");'
;

select normalize_js('áéÉña');

-- 'aeEna'

I got that JS code from Remove accents/diacritics in a string in JavaScript.

edited Sep 02 '21 at 17:12

answered Mar 12 '21 at 20:41

Felipe Hoffa

54,922
16
151
325

I just realized there's a newer question that is a duplicate of this one. It's here: https://stackoverflow.com/questions/69031822/remove-diacritics-from-string-in-snowflake. I found and used the same JS as Felipe, and it appears there's been an update for later versions of JavaScript, which Snowflake supports. The .replace can now use `(/\p{Diacritic}/gu, "")` to cover the entire range. The original expression missed a few. – Greg Pavlik Sep 02 '21 at 15:56

score 1 · Answer 2 · answered Mar 12 '21 at 20:27

the problem that you are facing is that U and Ú are completely different characters so from a coding/SQL perspective the statement "remove the accent from above the U" is meaningless.

The only option you have is to use, as you have done, the TRANSLATE function (or build equivalent functionality in a Stored Procedure). You will need to take every character that has an ascii value >= 128 and provide the character you want to replace it with. You could hard code this or, for more flexibility, put the accented chars, and the char you want to translate them to, into a table and use this in your SP

Removing accents from string in Snowflake

2 Answers2

Linked

Related