0

Is there a way to perform data cleaning and identify LATIN characters such as Á (and similar) and convert them to the nearest US English chars (in this case, 'A')?

I prefer to do this directly in TSQL code, but if its not possible, I am willing to employ other options such as any C# method.

https://en.wikipedia.org/wiki/List_of_Unicode_characters

I have hundreds of thousands records in a table. that data entries have such mistakes, these are causing problems which breaks the code..

Thanks.

David
  • 239
  • 3
  • 10
  • 2
    https://stackoverflow.com/a/12715102/1080354 You can simply convert the data in the T-SQL :-) – gotqn Feb 11 '20 at 05:56

2 Answers2

1

Updated Anwer (20120212):

Though my answer below will work, and considering that T-SQL TRANSLATE is especially fast, what gotn posted and David's answer above will perform much better.

It's worth noting (I just learned this) that both Cyrillic_General_CI_AI and Greek_CI_AI also work. This is good to know because some collations perform better than others so it's good to test.


Previous Answer

You didn't mention what version of SQL Server you are using.

If you are using SQL Server 2017+

Then you can leverage TRANSLATE like so:

DECLARE @string VARCHAR(100) = 'Thë Quìck Grééñ Fox Jumps ôver The Lázy Dög.';

SELECT      NewString = TRANSLATE(@string,accents.C,clean.C)
FROM        (VALUES('áÁâÂäÄãÃàÀéÉêÊëËèÈíÍîÎïÏìÌóÓôÔöÖõÕòÒúÚûÛüÜùÙýÝñÑçÇ')) AS accents(C)
CROSS APPLY (VALUES('aAaAaAaAaAeEeEeEeEiIiIiIiIoOoOoOoOoOuUuUuUuUyYnNcC')) AS clean(C);

Returns: The Quick Green Fox Jumps over The Lazy Dog.

If you are using an earlier version of SQL

For this you can leverage NGrams8K to create your own TRANSLATE function which I include at the end of this post.

SELECT      t.NewString
FROM        (VALUES('áÁâÂäÄãÃàÀéÉêÊëËèÈíÍîÎïÏìÌóÓôÔöÖõÕòÒúÚûÛüÜùÙýÝñÑçÇ')) AS accents(C)
CROSS APPLY (VALUES('aAaAaAaAaAeEeEeEeEiIiIiIiIoOoOoOoOoOuUuUuUuUyYnNcC')) AS clean(C)
CROSS APPLY samd.translate8K(@string,accents.C,clean.C)     AS t;

Here is the code I used to generate the string of accent characters:

SELECT CONCAT(
  CHAR(225),CHAR(193),CHAR(226),CHAR(194),CHAR(228),CHAR(196),CHAR(227),CHAR(195),CHAR(224),CHAR(192), -- a(10)
  CHAR(233),CHAR(201),CHAR(234),CHAR(202),CHAR(235),CHAR(203),CHAR(232),CHAR(200),                     -- e(8)
  CHAR(237),CHAR(205),CHAR(238),CHAR(206),CHAR(239),CHAR(207),CHAR(236),CHAR(204),                     -- i(8)
  CHAR(243),CHAR(211),CHAR(244),CHAR(212),CHAR(246),CHAR(214),CHAR(245),CHAR(213),CHAR(242),CHAR(210), -- o(10)
  CHAR(250),CHAR(218),CHAR(251),CHAR(219),CHAR(252),CHAR(220),CHAR(249),CHAR(217),                     -- u(8)
  CHAR(253),CHAR(221),CHAR(241),CHAR(209),CHAR(231),CHAR(199))                               -- y(2),n(2),c(2)

Custom Translate Function:

CREATE FUNCTION samd.translate8K
(
  @inputString  VARCHAR(8000), 
  @characters   VARCHAR(8000), 
  @translations VARCHAR(8000)
)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
/*****************************************************************************************
[Purpose]: Custom version of SQL Server 2017 Translate

-- Masking possible PII
DECLARE @inputstring VARCHAR(8000) = 
  'I don''t know if you got my SSN but it''s 555-90-5511. Call me at 312.800.5555';

SELECT t.newstring
FROM   samd.translate8K(@inputString, '0123456789', REPLICATE('#',10)) AS t;

[Revision History]:
------------------------------------------------------------------------------------------
 Rev 00 - 20180725 - Initial Development - Alan Burstein
****************************************************************************************/
SELECT newstring = 
(
  SELECT      CASE WHEN c.chr>c.tx THEN '' WHEN c.chr>0 THEN t.chr ELSE ng.token END+''
  FROM        samd.NGrams8k(@inputString,1)                                 AS ng 
  CROSS APPLY (VALUES(CHARINDEX(ng.token, @characters),LEN(@translations))) AS c(chr,tx)
  CROSS APPLY (VALUES(SUBSTRING(@translations,c.chr,1)))                    AS t(chr)
  ORDER BY ng.position
  FOR XML PATH(''), TYPE
).value('text()[1]', 'varchar(8000)');
GO

https://emw3.com/unicode-accents.html

Alan Burstein
  • 7,770
  • 1
  • 15
  • 18
1

Reference: (Thanks to @gotqn in the comments) stackoverflow.com/a/12715102/1080354

SELECT 'héllö! this is à test!' Collate SQL_Latin1_General_CP1253_CI_AI

result:

hello! this is a test!
David
  • 239
  • 3
  • 10