Efficient Cleaning of Strings in a Table

Question

I'm currently working on a problem where certain characters need to be cleaned from strings that exist in a table. Normally I'd do a simple UPDATE with a replace, but in this case there are 32 different characters that need to be removed.

I've done some looking around and I can't find any great solutions for quickly cleaning strings that already exist in a table.

Things I've looked into:

Doing a series of nested replaces

This solution is do-able, but for 32 different replaces it would require either some ugly code, or hacky dynamic sql to build a huge series of replaces.
PATINDEX and while loops

As seen in this answer it is possible to mimic a kind of regex replace, but I'm working with a lot of data so I'm hesitant to even trust the improved solution to run in a reasonable amount of time when the data volume is large.
Recursive CTEs

I tried a CTE approuch to this problem, but it didn't run terribly fast once the number of rows got large.

For reference:

CREATE TABLE #BadChar(
    id int IDENTITY(1,1),
    badString nvarchar(10),
    replaceString nvarchar(10)

);

INSERT INTO #BadChar(badString, replaceString) SELECT 'A', '^';
INSERT INTO #BadChar(badString, replaceString) SELECT 'B', '}';
INSERT INTO #BadChar(badString, replaceString) SELECT 's', '5';
INSERT INTO #BadChar(badString, replaceString) SELECT '-', ' ';

CREATE TABLE #CleanMe(
    clean_id int IDENTITY(1,1),
    DirtyString nvarchar(20)
);

DECLARE @i int;
SET @i = 0;
WHILE @i < 100000 BEGIN
    INSERT INTO #CleanMe(DirtyString) SELECT 'AAAAA';
    INSERT INTO #CleanMe(DirtyString) SELECT 'BBBBB';
    INSERT INTO #CleanMe(DirtyString) SELECT 'AB-String-BA';
    SET @i = @i + 1
END;


WITH FixedString (Step, String, cid) AS (
    SELECT 1 AS Step, REPLACE(DirtyString, badString, replaceString), clean_id
    FROM #BadChar, #CleanMe
    WHERE id = 1

    UNION ALL

    SELECT Step + 1, REPLACE(String, badString, replaceString), cid
    FROM FixedString AS T1
    JOIN #BadChar AS T2 ON T1.step + 1 = T2.id
    Join #CleanMe AS T3 on T1.cid = t3.clean_id

)
SELECT String FROM FixedString WHERE step = (SELECT MAX(STEP) FROM FixedString);

DROP TABLE #BadChar;
DROP TABLE #CleanMe;

Use a CLR

It seems like this is a common solution many people use, but the environment I'm in doesn't make this a very easy one to embark on.

Are there any other ways to go about this I've over looked? Or any improvements upon the methods I've already looked into for this?

What version of SQL Server? In the highly unlikely event that you are on 2017 there is also [`Translate`](https://learn.microsoft.com/en-us/sql/t-sql/functions/translate-transact-sql) though that is just syntax sugar for nested replace. — Martin Smith, Jul 24 '17 at 20:20
You could extract your data to some ETL tool and perform your operations there not to be highly dependent on database capabilities. Investigate Pentaho Data Integration which is free using Community Edition for example. — Kamil Gosciminski, Jul 24 '17 at 20:30
Nested replace is super fast. For heaven's sake don't spin up a CLR or ETL for something so simple. Do the nested replace, sure it is a little ugly syntactically but not a big deal at all. — Sean Lange, Jul 24 '17 at 20:49
If you're only doing single character replacements, and aren't using a version of SQL Server that supports TRANSLATE, this may be of interest: http://www.sqlservercentral.com/scripts/String+Function/141416/ (faster than recursive CTE, but slower than nested REPLACE) — hatchet - done with SOverflow, Jul 24 '17 at 21:36
@SeanLange I figured this may be the answer, was just hoping there might of been a nicer solution I over looked — FabAndrew, Jul 24 '17 at 21:59
Well with only 32 nested replaces writing this shouldn't take more than 20 minutes. And if you spend a little time with formatting it is very manageable...albeit a big ugly. — Sean Lange, Jul 25 '17 at 13:33

hatchet - done with SOverflow · Answer 1 · 2017-07-24T23:49:19.650

Leveraging the idea from Alan Burstein's solution, you could do something like this, if you wanted to hard code the bad/replace strings. This would work for bad/replace strings longer than a single character as well.

CREATE FUNCTION [dbo].[CleanStringV1]
(
  @String   nvarchar(4000)
)
RETURNS nvarchar(4000) WITH SCHEMABINDING AS 
BEGIN
 SELECT @string = REPLACE
  (
    @string COLLATE Latin1_General_BIN,
    badString,
    replaceString
  )
 FROM
 (VALUES
      ('A', '^')
    , ('B', '}')
    , ('s', '5')
    , ('-', ' ')
    ) t(badString, replaceString) 
 RETURN @string;
END;

Or, if you have a table containing the bad/replace strings, then

CREATE FUNCTION [dbo].[CleanStringV2]
(
  @String   nvarchar(4000)
)
RETURNS nvarchar(4000) AS 
BEGIN
 SELECT @string = REPLACE
  (
    @string COLLATE Latin1_General_BIN,
    badString,
    replaceString
  )
 FROM BadChar
 RETURN @string;
END;

These are case sensitive. You can remove the COLLATE bit if you want case insensitive. I did a few small tests, and these were not much slower than nested REPLACE. The first one with the hardcoded strings was a the faster of the two, and was nearly as fast as nested REPLACE.

Efficient Cleaning of Strings in a Table

1 Answers1

Linked