Efficiently replacing many characters from a string

Question

I would like to know the most efficient way of removing any occurrence of characters like , ; / " from a varchar column.

I have a function like this but it is incredibly slow. The table has about 20 million records.

CREATE FUNCTION [dbo].[Udf_getcleanedstring] (@s VARCHAR(255))
returns VARCHAR(255)
AS
  BEGIN
      DECLARE @o VARCHAR(255)

      SET @o = Replace(@s, '/', '')
      SET @o = Replace(@o, '-', '')
      SET @o = Replace(@o, ';', '')
      SET @o = Replace(@o, '"', '')

      RETURN @o
  END

How is this being invoked? For every record? for all varchar columns in a table? Just a few? What population of data contains these characters? You may be better off first searching and limiting the data set to only those records needing to be updated rather than calling it for every record... Additionally are any of the fields being updated indexed? If so turn that off.. run update on records that change, then re-enable and recalculate the index once. The overhead I/O of table and index write are what's slowing you down. So if you an limit the update to 200k records instead of 20 mil... — xQbert, Dec 23 '14 at 18:28
@xQbert Thank you for your suggestions. Yes, this is for every record. I am going to try your steps. The reason I posted this question was because I was looking for an alternative if there was any. — blue piranha, Dec 23 '14 at 18:46
possible duplicate of [How to Replace Multiple Characters in SQL?](http://stackoverflow.com/questions/1580017/how-to-replace-multiple-characters-in-sql) — Dour High Arch, Dec 23 '14 at 18:54

Martin Smith · Accepted Answer · 2014-12-26T12:34:22.030

Whichever method you use it is probably worth adding a

WHERE YourCol LIKE '%[/-;"]%'

Except if you suspect that a very large proportion of rows will in fact contain at least one of the characters that need to be stripped.

As you are using this in an UPDATE statement then simply adding the WITH SCHEMABINDING attribute can massively improve things and allow the UPDATE to proceed row by row rather than needing to cache the entire operation in a spool first for Halloween Protection

enter image description here

Nested REPLACE calls in TSQL are slow anyway though as they involve multiple passes through the strings.

You could knock up a CLR function as below (if you haven't worked with these before then they are very easy to deploy from an SSDT project as long as CLR execution is permitted on the server). The UPDATE plan for this too does not contain a spool.

The Regular Expression uses (?:) to denote a non capturing group with the various characters of interest separated by the alternation character | as /|-|;|\" (the " needs to be escaped in the string literal so is preceded by a slash).

using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Text.RegularExpressions;

public partial class UserDefinedFunctions
{
    private static readonly Regex regexStrip = 
                        new Regex("(?:/|-|;|\")", RegexOptions.Compiled);

    [SqlFunction]
    public static SqlString StripChars(SqlString Input)
    {
        return Input.IsNull ?  null : regexStrip.Replace((string)Input, "");        
    }
}

itzik Paz · Answer 2 · 2014-12-25T21:52:10.700

I want to show the huge performance differences between the using with 2 types of USER DIFINED FUNCTIONS:

User TABLE function
User SCALAR function

See the test example :

use AdventureWorks2012
go

-- create table for the test
create table dbo.FindString (ColA int identity(1,1) not null primary key,ColB varchar(max) );

declare @text varchar(max) =  'A web server can handle a Hypertext Transfer Protocol request either by reading 
a file from its file ; system based on the URL <> path or by handling the request using logic that is specific 
to the type of resource. In the case that special logic is invoked the query string will be available to that logic 
for use in its processing, along with the path component of the URL.';

-- init process in loop 1,000,000 
insert into dbo.FindString(ColB)
select @text 
go 1000000

-- use one of the scalar function from the answers which post in this thread
alter function [dbo].[udf_getCleanedString]
( 
@s varchar(max)
)
returns  varchar(max)
as
begin
return replace(replace(replace(replace(@s,'/',''),'-',''),';',''),'"','')
end
go
--
-- create from the function above new function an a table function ;
create function [dbo].[utf_getCleanedString]
( 
@s varchar(255)
)
returns  table 
as return
(
select  replace(replace(replace(replace(@s,'/',''),'-',''),';',''),'"','') as String
)
go

--
-- clearing the buffer cach
DBCC DROPCLEANBUFFERS ;
go
-- update process using USER TABLE FUNCTIO
update Dest with(rowlock) set
dest.ColB  = D.String
from dbo.FindString dest
cross apply utf_getCleanedString(dest.ColB) as D
go

DBCC DROPCLEANBUFFERS ;
go
-- update process using USER SCALAR FUNCTION
update Dest with(rowlock) set
dest.ColB  =  dbo.udf_getCleanedString(dest.ColB) 
from dbo.FindString dest
go

AND these are the execution plan : As you can see the UTF is much better the USF ,they 2 doing the same thing replacing string, but one return scalar and the other return as a table

As you can see the UTF is much better the USF

Another important parameter for you to see (SET STATISTICS IO ON ;)

SET STATISTICS IO ON

The scalar UDF can be improved by adding `WITH SCHEMABINDING`. Timings I got running against your table after the REPLACE had already been done (so none of them actually ended up altering the data) were 'TVF - 29.2 seconds, UDF - 65.1 seconds, UDF with schema binding - 39.7 seconds, CLR UDF - 15.1' — Martin Smith, Dec 26 '14 at 12:20
Also adding a `WHERE dest.ColB LIKE '%[/-;"]%'` was worthwhile for all of them rather than just blindly updating all rows but more so for the TVF and CLR UDF as they got parallel plans. — Martin Smith, Dec 26 '14 at 12:25

score 0 · Answer 3 · answered Dec 23 '14 at 18:30

0

How about nesting them together in a single call:

 create function [dbo].[udf_getCleanedString]
 ( 
    @s varchar(255)
 )
 returns varchar(255)
 as
 begin
   return replace(replace(replace(replace(@s,'/',''),'-',''),';',''),'"','')
 end

Or you may want to do an UPDATE on the table itself for the first time. Scalar functions are pretty slow.

answered Dec 23 '14 at 18:30

Code Different

90,614
16
144
163

7

I don't think this version of the scalar UDF is going to be any faster than the original. – Aaron Bertrand Dec 23 '14 at 18:40
1

Why do you think this is a single call? Doesn't it have to call the `REPLACE` each time even if you nest them like this? – Michael McGriff Dec 23 '14 at 20:11

score 0 · Answer 4 · edited May 23 '17 at 12:16

Here is a similar question asked previously, I like this approach mentioned here.

How to Replace Multiple Characters in SQL?

declare @badStrings table (item varchar(50))

INSERT INTO @badStrings(item)
SELECT '>' UNION ALL
SELECT '<' UNION ALL
SELECT '(' UNION ALL
SELECT ')' UNION ALL
SELECT '!' UNION ALL
SELECT '?' UNION ALL
SELECT '@'

declare @testString varchar(100), @newString varchar(100)

set @teststring = 'Juliet ro><0zs my s0x()rz!!?!one!@!@!@!'
set @newString = @testString

SELECT @newString = Replace(@newString, item, '') FROM @badStrings

select @newString -- returns 'Juliet ro0zs my s0xrzone'

Efficiently replacing many characters from a string

4 Answers4