-3

I have imported the data into my DB but found the special characters in my company names.

The Special characters are as below

special character

Alecta Pensionsförsäkring, ömsesidigt

Normal character

Alecta Pensionsforsakring,Omsesidigt

special character

Aelefónica, S.A.

Normal character

Aelefonica,S.A.

I have tried with replace method with hardcoded special characters but in the real data...contains 1000's of records and it's very difficult for me to keep the hardcoded values... Kindly provide me a SQL dynamic query which will find out the special characters in a string.

jarlh
  • 42,561
  • 8
  • 45
  • 63
venugopal
  • 79
  • 10
  • Column data type? Typical encoding issue. – jarlh Jan 13 '22 at 14:11
  • 2
    Maybe you need to re-import the data with the right encoding / codepage / collation at all steps. – Aaron Bertrand Jan 13 '22 at 14:12
  • How does the file contents look? File encoding? – jarlh Jan 13 '22 at 14:14
  • In source we can't modify the script and the packages are automated. The only approach we have is from the data base level by using the SQL scripts need to modify the company names – venugopal Jan 13 '22 at 14:15
  • Even if you can't do anything about how it's imported *O_o* it might be still usefull to know what the original encoding was. Btw, could it be that the column is a VARCHAR (instead of NVARCHAR) – LukStorms Jan 13 '22 at 14:47
  • Looks like utf vs ascii issues, I guess the correct spelling is `pensionsförsäkring` etc. – James Z Jan 13 '22 at 15:19

1 Answers1

0

The original encoding is probably UTF-8.

So you can use a UDF to make it more readable.

The UDF is based on this old answer from Xabi

CREATE FUNCTION Utf8ToUcs(@src varchar(MAX)) 
RETURNS NVARCHAR(MAX) 
WITH SCHEMABINDING
AS
BEGIN
    DECLARE @i INT
          , @res NVARCHAR(MAX)=@src
          , @pi VARCHAR(18);
          
    SELECT 
      @pi = '%[à-ï][€-¿][€-¿]%'
    , @i = PATINDEX(@pi, @src COLLATE Latin1_General_BIN);
 
    WHILE @i > 0 
      SELECT 
        @res = STUFF(@res,@i,3, NCHAR(
                      ((ASCII(SUBSTRING(@src,@i,1))&31)*4096)
                     +((ASCII(SUBSTRING(@src,@i+1,1))&63)*64)
                     +(ASCII(SUBSTRING(@src,@i+2,1))&63)))
      , @src = STUFF(@src,@i,3,'.')
      , @i = PATINDEX(@pi,@src COLLATE Latin1_General_BIN);

    SELECT 
      @pi = '%[Â-ß][€-¿]%'
    , @i = PATINDEX(@pi,@src COLLATE Latin1_General_BIN);

    WHILE @i > 0 
      SELECT 
        @res = STUFF(@res,@i,2,NCHAR(
                       ((ASCII(SUBSTRING(@src,@i,1))&31)*64)
                       +(ASCII(SUBSTRING(@src,@i+1,1))&63)))
      , @src = STUFF(@src,@i,2,'.')
      , @i = PATINDEX(@pi,@src COLLATE Latin1_General_BIN);

    RETURN @res;
END;
select col
, dbo.Utf8ToUcs(col) as col_ucs
from test
col col_ucs
Alecta Pensionsförsäkring, ömsesidigt Alecta Pensionsförsäkring, ömsesidigt
Aelefónica, S.A. Aelefónica, S.A.
Áæâàéêèeeeëöu Áæâàéêèeeeëöu

And then you can also remove the accents.
Here's it's done via a collation. (solution taken from here)

select col
, cast(dbo.Utf8ToUcs(col) as varchar(100)) Collate SQL_Latin1_General_CP1253_CI_AI as col_cp1253
from test
col col_cp1253
Alecta Pensionsförsäkring, ömsesidigt Alecta Pensionsforsakring, omsesidigt
Aelefónica, S.A. Aelefonica, S.A.
Áæâàéêèeeeëöu A?aaeeeeeeeou

Demo on db<>fiddle here

LukStorms
  • 28,916
  • 5
  • 31
  • 45