1

All, I have some really dirty text being imported to SQL and I can't work out how to clean it up. After a lot of digging it seems the file contains some non ASCII characters, hex99 for example being the TM code.

My required output from this string

ENM / éææ¨FEE/\~`+=-

Would be

'ENM / FEE/\~`+=-'

I am sure there is a quick fix but I have not yet found it.

Database collation is SQL_Latin1_General_CP1_CI_AS

Here is the dirty string I have the issue with ENM/éææ¨FEE I don't know how it will render here but its basically something like ENM/金晓晨FEE This is not rendering with the correct dodgy characters

A file with the text in is here: https://textsaver.flap.tv/lists/3fjv

I've tried various ways to try and get rid of these characters

This is my existing replace function, pretty standard stuff

 CREATE FUNCTION [dbo].[ufn_CleanText]
 (
     @String NVARCHAR(MAX) 

 )
 RETURNS NVARCHAR(MAX)
 AS
 BEGIN
  DECLARE @Result nvarchar(MAX)
     SET @Result =''

     DECLARE @character nvarchar(1)
     DECLARE @index int

     SET @index = 1
     WHILE @index <= LEN(@String)
     BEGIN
         SET @character = SUBSTRING(@String, @index, 1)

         IF (UNICODE(@character) between 32 and 127) --or UNICODE(@character) in (10,11)
             SET @Result = @Result + @character
         SET @index = @index + 1
     END

     RETURN @Result

 END;
GO

Selects

select dbo.ufn_CleanText( cast('ENM/éææ¨FEE' COLLATE Latin1_General_100_BIN2 as nvarchar ))
union all
select cast('ENM/éææ¨FEE' as nvarchar)
union all
SELECT REPLACE('ENM/éææ¨FEE',char(0x99),'')

returns

ENM/?????FEE
ENM/é??æ??æ?¨FEE
ENM/é??æ??æ?¨FEE

What is frustrating is I can do this pretty easily with c# so I could do a CLR but I would rather do this all in SQL if possible? I am not sure if there is some way to use pat index like this in SQL?

e.g

string stringvar = "ENM / 金晓晨FEE";
Console.WriteLine($"This is dirty: {stringvar}");
string clean = Regex.Replace(stringvar, @"[^\u0000-\u007F]+", string.Empty);
Console.WriteLine($"this is clean {clean}");
Console.ReadLine();

returns

This is dirty: ENM / é??æ??æ?"FEE
this is clean ENM / FEE

Edit1:

Ok so I found a patindex function: How to strip all non-alphabetic characters from string in SQL Server?

CREATE FUNCTION [dbo].[fn_StripCharacters]
(
    @String NVARCHAR(MAX), 
    @MatchExpression VARCHAR(255)
)
RETURNS NVARCHAR(MAX)
AS
BEGIN
    SET @MatchExpression =  '%['+@MatchExpression+']%'

    WHILE PatIndex(@MatchExpression, @String) > 0
        SET @String = Stuff(@String, PatIndex(@MatchExpression, @String), 1, '')

    RETURN @String

END

and if I call it like this:

select dbo.fn_StripCharacters('ENM/éææ¨FEE/\~+=-' COLLATE Latin1_General_BIN, '^0-9a-z/+-*/=')` Then I get back something which is closer

ENM/éææFEE/=

But it still contains characters that I do not want e.g é and excludes some that I do but its a step in the right direction

Edit2:

here is the hexdump:

Offset: 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000: 45 4E 4D 2F C3 A9 C2 87 C2 91 C3 A6 C2 99 C2 93
00000010: C3 A6 C2 99 C2 A8 46 45 45

Adam Davies
  • 145
  • 1
  • 8

1 Answers1

1

It's a common mistake to add unicode to a non-unicode literal. This expression

'ENM/金晓晨FEE'

converts the unicode string to a varchar, corrupting it.

The equivilent of C#'s

string stringvar = "ENM / 金晓晨FEE";

is

declare @stringvar nvarchar(max) = N'ENM/金晓晨FEE';
David Browne - Microsoft
  • 80,331
  • 6
  • 39
  • 67
  • Hmm that makes sense, in reality this is coming from a column in the database so it may already be varchar rather than NVARCHAR. I think the problem is as well that the text able is not displaying as it actually it. I.e. what I have put above is not being rendered as a try representation of the text I have to clean – Adam Davies Jun 09 '20 at 16:35
  • Cast to varbinary to examine the ground truth. – David Browne - Microsoft Jun 09 '20 at 16:44
  • I've added a link to the OP which contains the original string – Adam Davies Jun 09 '20 at 16:54
  • You can change fn_StripCharacters to operate on varbinary and strip or replace the unwanted code points. – David Browne - Microsoft Jun 09 '20 at 16:56
  • 1
    I think you've hit the issue, the problem is the way the data is getting into the table, I'll need to go back to the source system and see if we can get it in correctly as NVARCHAR – Adam Davies Jun 09 '20 at 17:10