Replacing Unicode characters using TRANSLATE function

Question

A customer asked to create a custom character mapper function from specific names to ASCII in their SQL database.

Here is a simplified fragment that works (shortened for brevity):

select TRANSLATE(N'àáâãäåāąæậạả', 
                 N'àáâãäåāąæậạả', 
                 N'aaaaaaaaaaaa');

While analyzing the results on customer's dataset, I noticed one more unmapped symbol ă. So I added it to the mapper as follows:

select TRANSLATE(N'àáâãäåāąæậạảă', 
                 N'àáâãäåāąæậạảă', 
                 N'aaaaaaaaaaaaa');

Unexpectedly, it started failing with the message:

The second and third arguments of the TRANSLATE built-in function must contain an equal number of characters.

Obviously, TRANSLATE thinks that ă is special and consists of more than one character. Actually, even Notepad thinks the same (copy ă and try to delete it using Backspace key - something unusual will happen. Delete key works normally, though).

Then I thought - if TRANSLATE considers it a two-char symbol, let's add a two char mapping then:

select TRANSLATE(N'àáâãäåāąæậạảă', 
                 N'àáâãäåāąæậạảă', 
                 N'aaaaaaaaaaaaaa');

No errors this time, yay. But the input string was not processed correctly, ă was not replaced with a.

What is the correct (case-sensitive) way to replace such "double symbols"? Can it be done using TRANSLATE at all? I don't want to add a bunch of REPLACE for every such symbol I find.

What do you mean by Unicode characters? All the characters in this page are Unicode. The same with all characters in this comment. The very requirement is extremely suspicious and needs to be explained and justified. It's quite likely there's no need for such "mapping" and the database collation should be fixed instead — Panagiotis Kanavos, Jun 03 '21 at 13:32
As for "double symbols" - you're out of luck. Some characters can be generated by combining diacritics, others are old enough that they have their own Unicode character. A glyph may be a single character or a combination of two *or more* .SQL, the language wasn't built for text manipulation and *definitely* not for Unicode transformations. — Panagiotis Kanavos, Jun 03 '21 at 13:35
@PanagiotisKanavos Although many symbols might be Unicode (but some might belong to extended Latin charset), still only a few create such problems. For example - ă and æ̆ breaks TRANSLATE, but ĕ does not. I tried multiple collations, but none of them gave the exact results the customer needed, that's why I'm trying to map them manually. — JustAMartin, Jun 03 '21 at 13:38
What does the customer need? That's an arbitrary conversion. This isn't the 1970s. This conversion would break a lot of English and American names. Jayne Eyre was written by Charlotte Brontë. If the customer wants to index and match in a specific way, they need to set the collation properly, not mangle the data — Panagiotis Kanavos, Jun 03 '21 at 13:41
@Larnu Maybe you copied the first code fragment. The problematic is the second one. — JustAMartin, Jun 03 '21 at 13:43
@PanagiotisKanavos ... says someone whose SO username is Latinized Greek :-) — Charlieface, Jun 03 '21 at 13:43
And once again, SQL can't do this. It simply doesn't know about Unicode. If you really want to replace such characters you'd have to use a SQLCLR function and make the replacement in C#, using methods like [String.Normalize](https://learn.microsoft.com/en-us/dotnet/api/system.string.normalize?view=net-5.0#System_String_Normalize) to normalize multiple forms into one, and [Char methods](https://learn.microsoft.com/en-us/dotnet/api/system.char.issurrogate?view=net-5.0) like IsSurrogate, IsPunctuation etc — Panagiotis Kanavos, Jun 03 '21 at 13:44
@Charlieface which is why I know such conversions are pointless, because outside the US and UK people have been using Unicode for the last 20-25 years. It's funny seeing all the Unicode-related questions by Python or R devs that think Unicode is an escape sequence, and frustrating trying to explain it to them. — Panagiotis Kanavos, Jun 03 '21 at 13:46
@PanagiotisKanavos - yep, the customer wants to map to ASCII by their own rules. Most likely for printing on some systems that do not support entering such symbols. Also, to make life easier for system operators when searching. BTW, Russian -> ASCII transliteration is also often requested function where I live. So, sometimes ASCII still rules the world. — JustAMartin, Jun 03 '21 at 13:47
@JustAMartin What you say typically happens with ... unfortunate ... software in regulated or closed markets (eg tax software). This doesn't change the fact you *can't* do what you want with TRANSLATE. You need [String.Nomalize](https://learn.microsoft.com/en-us/dotnet/api/system.string.normalize?view=net-5.0#System_String_Normalize) to replace multiple representations with one. In the doc example `ắ` can use 1,2 or 3 code points. — Panagiotis Kanavos, Jun 03 '21 at 13:56
PS I didn't make up the `ắ` example. But it does show that TRANSLATE is impractical - just how many `a`s are there? There are questions asking how to convert such characters and one trick is to convert to the form using all diacritics then *strip the diacritics*. — Panagiotis Kanavos, Jun 03 '21 at 13:58
[How do I remove diacritics (accents) from a string in .NET?](https://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net) should help. Using the method in the accepted answer produces `aaaaaaaaæaaaaa`. The `æ` isn't converted because it's *not an accented `a`. It's A and E. As in [Æthelstan](https://en.wikipedia.org/wiki/%C3%86thelstan), an English king. That name is also spelled [`Athelstan`, `Ethelstan` and `Aethelstan`](https://www.britannica.com/biography/Athelstan) — Panagiotis Kanavos, Jun 03 '21 at 14:02
@PanagiotisKanavos Thanks, I guess I'll have to use some combined solution to remove accents and also do replacements. — JustAMartin, Jun 03 '21 at 14:09
And the [quick&dirty, YOLO method](https://stackoverflow.com/a/2086575/134204) - convert to Hebrew, or any *non-Latin* codepage, then back. Most, but not all characters will be translated to their closest equivalent in the target. The equivalent `select convert(varchar(50), N'àáâãäåāąæậạả') collate Cyrillic_General_CI_AI` produces `aaaaaaaa????`. — Panagiotis Kanavos, Jun 03 '21 at 14:26

Replacing Unicode characters using TRANSLATE function

0 Answers0