2

I have a varchar column and i want to replace all diacritics with normal letters

For example:

  • In: São Paulo Out: Sao Paulo
  • In: eéíãç Out: eeiac
luisvenezian
  • 441
  • 5
  • 18
  • Updated my answer with encoding you may want to use. Please consider adding additional context/details to your question for others that may be encountering similar issues. – pygeek Oct 08 '20 at 20:48
  • If it's actually *all* diacritics (https://en.wikipedia.org/wiki/List_of_precomposed_Latin_characters_in_Unicode) it's really hard, you might create a huge list and `otranslate` to the base character. If you hide it in a SQL UDF it would be easy to use. But the best solution is probably based on a C-UDF, search for existing C-code doing this and then you just have to wrap it in a TD C-UDF – dnoeth Oct 08 '20 at 21:49
  • appreciate all your thoughts and so i'll keep trying it by using Translate, SQL UDF or C-UDF – luisvenezian Oct 09 '20 at 02:37
  • What's the character type of the base column? LATIN or UNICODE? – dnoeth Oct 09 '20 at 08:22
  • By default teradata has been setting LATIN when i create varchar columns – luisvenezian Oct 09 '20 at 11:37

1 Answers1

3

A diacritical character is a composite character, i.e. can be a base char plus a diacritic, e.g.

   'a' +  '´' =  'á'
  0061 + 00B4 = 00E1   
    

Both 006100B4 and 00E1 result in the same character, Unicode allows to switch back and forth using normalization functions, which are supported by Teradata:

translate(string using UNICODE_TO_UNICODE_NFD)

decomposes a composite character into separate characters. Those Combining Diacritical Marks are in a Unicode block ranging from U+0300 to U+036F.

Now decompose the input and apply a Regular Expression to remove characters from this range:

select
   'Ťĥïŝ ẅåš á şťŕĭńġ ŵïţĥ ḿäṇȳ ḓìāčṝĩţïĉș' as str,
   REGEXP_REPLACE(translate(str using UNICODE_TO_UNICODE_NFD) -- decomposed input
                 ,'[\x{0300}-\x{036F}]'                        -- diacritics
                 ,'')

returns

Ťĥïŝ ẅåš á şťŕĭńġ ŵïţĥ ḿäṇȳ ḓìāčṝĩţïĉș
This was a string with many diacritics

If there are other decomposable characters you might need to compose them again to save space using another translate(... using UNICODE_TO_UNICODE_NFC)

If you input string has a LATIN charset it might be easier to find the limited list of diacritical characters and apply translate:

oTranslate(str, 'àáâãäåÀÁÂÃÄÅÇ...', 'aaaaaaAAAAAAC...')
dnoeth
  • 59,503
  • 4
  • 39
  • 56
  • I'll check your answer because that code with regex solved my examples but when i tried with your example it returns `This ?a a string with ?a?? ?iac?itic?`, do you know what could be happening? – luisvenezian Oct 09 '20 at 13:45
  • Does the original or the replaced string look like this? Check your session character set and your client font. – dnoeth Oct 09 '20 at 14:26
  • the replaced, i was using ASCII as connection session character, also i tried UTF8 but had error when insert: `The string contains an untranslatable character`. I can't use LATIN1250_1A0 because it raises: `databases does not support...`. Should i start a new question? – luisvenezian Oct 09 '20 at 15:04
  • What's your client? Of course, ASCII doesn't support some of those characters, but UTF-16 or UTF-8 should work. The *untranslatable* error is strange, you wrote *when insert*: did you try to insert it into a LATIN column? – dnoeth Oct 09 '20 at 20:31
  • Caution, I had a typo (missing x) in my Unicode range, which will remove additinal characters, fixed. – dnoeth Oct 10 '20 at 09:54