1

This is a question that's been answered before here: How to strip all non-alphabetic characters from string in SQL Server?

I have used it successfully for a previous project, but that was on a mere 100,000 rows. Although not super quick, it did the job.

I now need to work out how to scale this up to cope with hundreds of millions of rows, I have read about inline functions and am wondering if anyone else has managed to crack this.

As an example, It was previously taking 5 minutes to process 500,000 rows as part of a larger query. But now that the data set size has increased to 500,000,000 rows this is now taking almost 3 hours.

The code that was used before in a scalar function was this

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER Function [dbo].[RemoveNonAlphaCharacters](@Temp VarChar(1000))
Returns VarChar(1000)
AS
Begin

    Declare @KeepValues as varchar(50)
    Set @KeepValues = '%[^a-z0-9]%'
    While PatIndex(@KeepValues, @Temp) > 0
        Set @Temp = Stuff(@Temp, PatIndex(@KeepValues, @Temp), 1, '')

    Return @Temp
End

Any help is much appreciated

Dale K
  • 25,246
  • 15
  • 42
  • 71
SQL_Novice
  • 61
  • 1
  • 2
  • 11
  • 2
    SQL Server isn't really made for that kind of operations. If possible, you should process this out of the SQL base and update the results. Like going for a BCP out to get your column content and the primary key column and then process it in Java/Python/C# or whatever language you use and then BCP in to update your column based on the primary key and on the result of the external process. – Kirjava Oct 15 '20 at 17:53
  • Unfortunately, I am quite restricted with what I can do on the server, the data has to remain intact as the special characters are there on purpose and add meaning. – SQL_Novice Oct 15 '20 at 18:12
  • 3
    _data has to remain intact_ But you can cleanse it at the time of insertion or modification and store that result separately. In short - an RBAR approach will never scale (which is what a scalar function containing a loop is) nor does it make sense to do this at runtime with every query. Now is a good time to have a discussion with the entire team (or at least the leaders) about the most appropriate method to achieve your goal. – SMor Oct 15 '20 at 18:51
  • 2
    And just a note - any name should clearly indicate what it does or contains. Your function will keep alphabetic **and numeric characters**. And your linked answer also contains multiple solutions - did you try the others? – SMor Oct 15 '20 at 18:56
  • Sorry there is one crucial element that I have forgotten to add. The column that I am using joins to another table which filters the column to only keep a subset of the results. So the function is used in a join so that each row is cleaned before joined, if a match is found then it is returned. – SQL_Novice Oct 15 '20 at 19:29
  • 1
    @SQL_Novice the answer is, not with SQL. You could write a SQLCLR function that cleans the data using a regex, write a C# program that uses a regex to read and clean the data, do the same with SSIS, or in SQL Server 2016 and later, use an R or Python script. Just don't use SQL. A regex would take 1/100th of the CPU cycles this loop takes – Panagiotis Kanavos Oct 15 '20 at 19:33
  • 1
    @SQL_Novice of course, the *best* place to clean up is when inserting new data. It's always faster to clean eg 1000 new rows instead of 50M rows, most of which are already cleaned – Panagiotis Kanavos Oct 15 '20 at 19:33
  • This is how i am currently using the scalar function INNER JOIN CODE_LIST C ON C.CODE = dbo.RemoveNonAlphaCharacters(CLEANED.CODE) – SQL_Novice Oct 15 '20 at 19:34

0 Answers0