All, I have some really dirty text being imported to SQL and I can't work out how to clean it up. After a lot of digging it seems the file contains some non ASCII characters, hex99 for example being the TM code.
My required output from this string
ENM / éææ¨FEE/\~`+=-
Would be
'ENM / FEE/\~`+=-'
I am sure there is a quick fix but I have not yet found it.
Database collation is SQL_Latin1_General_CP1_CI_AS
Here is the dirty string I have the issue with ENM/éææ¨FEE
I don't know how it will render here but its basically something like ENM/金晓晨FEE
This is not rendering with the correct dodgy characters
A file with the text in is here: https://textsaver.flap.tv/lists/3fjv
I've tried various ways to try and get rid of these characters
This is my existing replace function, pretty standard stuff
CREATE FUNCTION [dbo].[ufn_CleanText]
(
@String NVARCHAR(MAX)
)
RETURNS NVARCHAR(MAX)
AS
BEGIN
DECLARE @Result nvarchar(MAX)
SET @Result =''
DECLARE @character nvarchar(1)
DECLARE @index int
SET @index = 1
WHILE @index <= LEN(@String)
BEGIN
SET @character = SUBSTRING(@String, @index, 1)
IF (UNICODE(@character) between 32 and 127) --or UNICODE(@character) in (10,11)
SET @Result = @Result + @character
SET @index = @index + 1
END
RETURN @Result
END;
GO
Selects
select dbo.ufn_CleanText( cast('ENM/éææ¨FEE' COLLATE Latin1_General_100_BIN2 as nvarchar ))
union all
select cast('ENM/éææ¨FEE' as nvarchar)
union all
SELECT REPLACE('ENM/éææ¨FEE',char(0x99),'')
returns
ENM/?????FEE
ENM/é??æ??æ?¨FEE
ENM/é??æ??æ?¨FEE
What is frustrating is I can do this pretty easily with c# so I could do a CLR but I would rather do this all in SQL if possible? I am not sure if there is some way to use pat index like this in SQL?
e.g
string stringvar = "ENM / 金晓晨FEE";
Console.WriteLine($"This is dirty: {stringvar}");
string clean = Regex.Replace(stringvar, @"[^\u0000-\u007F]+", string.Empty);
Console.WriteLine($"this is clean {clean}");
Console.ReadLine();
returns
This is dirty: ENM / é??æ??æ?"FEE
this is clean ENM / FEE
Edit1:
Ok so I found a patindex function: How to strip all non-alphabetic characters from string in SQL Server?
CREATE FUNCTION [dbo].[fn_StripCharacters]
(
@String NVARCHAR(MAX),
@MatchExpression VARCHAR(255)
)
RETURNS NVARCHAR(MAX)
AS
BEGIN
SET @MatchExpression = '%['+@MatchExpression+']%'
WHILE PatIndex(@MatchExpression, @String) > 0
SET @String = Stuff(@String, PatIndex(@MatchExpression, @String), 1, '')
RETURN @String
END
and if I call it like this:
select dbo.fn_StripCharacters('ENM/éææ¨FEE/\~
+=-' COLLATE Latin1_General_BIN, '^0-9a-z/+-*/=')`
Then I get back something which is closer
ENM/éææFEE/=
But it still contains characters that I do not want e.g é
and excludes some that I do but its a step in the right direction
Edit2:
here is the hexdump:
Offset: 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000: 45 4E 4D 2F C3 A9 C2 87 C2 91 C3 A6 C2 99 C2 93
00000010: C3 A6 C2 99 C2 A8 46 45 45