97

I've got dirty data in a column with variable alpha length. I just want to strip out anything that is not 0-9.

I do not want to run a function or proc. I have a script that is similar that just grabs the numeric value after text, it looks like this:

Update TableName
set ColumntoUpdate=cast(replace(Columnofdirtydata,'Alpha #','') as int)
where Columnofdirtydata like 'Alpha #%'
And ColumntoUpdate is Null

I thought it would work pretty good until I found that some of the data fields I thought would just be in the format Alpha # 12345789 are not.

Examples of data that needs to be stripped

AB ABCDE # 123
ABCDE# 123
AB: ABC# 123

I just want the 123. It is true that all data fields do have the # prior to the number.

I tried substring and PatIndex, but I'm not quite getting the syntax correct or something. Anyone have any advice on the best way to address this?

Alexander Abakumov
  • 13,617
  • 16
  • 88
  • 129
SQL_Noob
  • 1,161
  • 1
  • 9
  • 18
  • Related: http://stackoverflow.com/questions/614423/removing-non-numeric-characters-in-t-sql – Jon Schneider May 05 '15 at 14:46
  • 1
    possible duplicate of [Fastest way to remove non-numeric characters from a VARCHAR in SQL Server](http://stackoverflow.com/questions/106206/fastest-way-to-remove-non-numeric-characters-from-a-varchar-in-sql-server) – Jon Schneider May 05 '15 at 14:53

20 Answers20

108

See this blog post on extracting numbers from strings in SQL Server. Below is a sample using a string in your example:

DECLARE @textval NVARCHAR(30)
SET @textval = 'AB ABCDE # 123'

SELECT LEFT(SUBSTRING(@textval, PATINDEX('%[0-9.-]%', @textval), 8000),
           PATINDEX('%[^0-9.-]%', SUBSTRING(@textval, PATINDEX('%[0-9.-]%', @textval), 8000) + 'X') -1)
Ken Richards
  • 2,937
  • 2
  • 20
  • 22
  • 1
    Why is the number 8000 used in the answer? – Anders Lindén Oct 29 '18 at 11:00
  • 3
    8000 is used because he is taking the first 8000 characters, the max size of a VARCHAR string. However, as the text was defined as NVARCHAR, it could have been 4000. My question is, is this really necessary? – RPh_Coder Mar 15 '19 at 16:12
  • 5
    Two comments: 1) gets my vote for being a straight-up expression rather than a procedure or function; 2) The question asked to remove *all* characters except 0-9. The answer here needs to be modified in 3 places to meet this requirement: Replace `0-9.-` with `0-9` (ie. remove `.-` in 3 places). – youcantryreachingme Feb 10 '20 at 22:05
  • 4
    This solution doesn't work if you have a mixed value of characters and numbers. In Oracle I simply used the TRANSLATE function to give me either Numbers or Alpha however with SQL Server TRANSLATE doesn't become available until 2017 which my company has not yet upgraded all instances to use. – Code Novice Jun 05 '20 at 22:23
  • 4
    This doesn't work. See this example. SET @textval = 'AB ABC+DE # 123+' – Gabe Jun 25 '20 at 17:33
  • 1
    This just finds the first numeric chunk and extracts it. If your string is `-123-456-789` it will return `123` – LCIII Apr 22 '21 at 14:46
  • What if the string starts with numbers and then characters and then again numbers? For example: '123abc$%10xyz9'. How can I get '123109' as the result with a select statement? – Nitin Deb Apr 27 '21 at 19:19
  • 1
    Generally doesn't work if there are more non-digit characters and then digits, after the first set. Example: ' (323) 515-0000' returns '323' only. – RBerman Oct 28 '21 at 20:00
57

Here is a practical solution if your server supports the TRANSLATE function (on sql server it's available on sql server 2017+ and also sql azure).

First, it replaces any non numeric characters with a @ character. Then, it removes all @ characters. You may need to add additional characters that you know may be present in the second parameter of the TRANSLATE call.

select REPLACE(TRANSLATE([Col], 'abcdefghijklmnopqrstuvwxyz+()- ,#+', '@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'), '@', '')
Clement
  • 3,990
  • 4
  • 43
  • 44
  • 1
    I don't know if it's elegant (but rather practical) but it works so I take – pjaaar Apr 18 '23 at 08:11
  • It didn't for following example DECLARE @strAlphaNumeric VARCHAR(256) = 'AB ABCDE # 123 ddf@ 445 ffff ** w1s3' select REPLACE(TRANSLATE(@strAlphaNumeric, 'abcdefghijklmnopqrstuvwxyz+()- ,#+', '@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'), '@', '') – LCJ Apr 25 '23 at 18:37
43

You can use stuff and patindex.

stuff(Col, 1, patindex('%[0-9]%', Col)-1, '')

SQL Fiddle

Mikael Eriksson
  • 136,425
  • 22
  • 210
  • 281
  • 14
    This will only remove the first occurance of non-numeric chars. – mmigdol Nov 07 '19 at 19:09
  • 2
    It doesn't remove any for me. – J Brun Jun 01 '20 at 23:18
  • This solution only removes the first substring of non-numeric characters. – alejandrob Apr 18 '23 at 13:42
  • @alejandrob It works just fine with the sample data provided by the OP. Do you perhaps have some other data in mind? Like for instance `ABC 123 EFG 456`? – Mikael Eriksson Apr 18 '23 at 14:04
  • OP "I just want to strip out ANYTHING that is not 0-9." your non-iterative function only works for the FIRST substring that matches your regular expression and then it leaves everything unchanged. @mikaeleriksson – alejandrob Apr 24 '23 at 01:51
33

This works well for me:

CREATE FUNCTION [dbo].[StripNonNumerics]
(
  @Temp varchar(255)
)
RETURNS varchar(255)
AS
Begin

    Declare @KeepValues as varchar(50)
    Set @KeepValues = '%[^0-9]%'
    While PatIndex(@KeepValues, @Temp) > 0
        Set @Temp = Stuff(@Temp, PatIndex(@KeepValues, @Temp), 1, '')

    Return @Temp
End

Then call the function like so to see the original something next to the sanitized something:

SELECT Something, dbo.StripNonNumerics(Something) FROM TableA
BBauer42
  • 3,549
  • 10
  • 44
  • 81
  • 1
    Thank you very much! You're a life-saver. I modified your Regex to include decimal point `Set @KeepValues = '%[^0-9].%'` but otherwise the entire code works perfectly as expected. ;) – Annie Lagang Feb 03 '20 at 05:55
29

In case if there are some characters possible between digits (e.g. thousands separators), you may try following:

declare @table table (DirtyCol varchar(100))
insert into @table values
    ('AB ABCDE # 123')
    ,('ABCDE# 123')
    ,('AB: ABC# 123')
    ,('AB#')
    ,('AB # 1 000 000')
    ,('AB # 1`234`567')
    ,('AB # (9)(876)(543)')

;with tally as (select top (100) N=row_number() over (order by @@spid) from sys.all_columns),
data as (
    select DirtyCol, Col
    from @table
        cross apply (
            select (select C + ''
            from (select N, substring(DirtyCol, N, 1) C from tally where N<=datalength(DirtyCol)) [1]
            where C between '0' and '9'
            order by N
            for xml path(''))
        ) p (Col)
    where p.Col is not NULL
)
select DirtyCol, cast(Col as int) IntCol
from data

Output is:

DirtyCol              IntCol
--------------------- -------
AB ABCDE # 123        123
ABCDE# 123            123
AB: ABC# 123          123
AB # 1 000 000        1000000
AB # 1`234`567        1234567
AB # (9)(876)(543)    9876543

For update, add ColToUpdate to select list of the data cte:

;with num as (...),
data as (
    select ColToUpdate, /*DirtyCol, */Col
    from ...
)
update data
set ColToUpdate = cast(Col as int)
i-one
  • 5,050
  • 1
  • 28
  • 40
  • 1
    Thank you! This should be the accepted answer. You can make this into an inline function and it'll perform faster than the other functions mentioned here. – Gabe Jun 25 '20 at 18:12
  • Good answer. I think it should be updated to use 'len' instead of 'datalength' in 'N<=datalength(DirtyCol)' because a nvarchar column will have two bytes per character. – ubergeek Feb 28 '22 at 22:55
  • @ubergeek why? What I'm missing here? – Henryk Budzinski Dec 09 '22 at 00:10
10

Pretty late to the party, I found the following which I though worked brilliantialy.. if anyone is still looking

SELECT
    (SELECT CAST(CAST((
        SELECT SUBSTRING(FieldToStrip, Number, 1)
        FROM master..spt_values
        WHERE Type='p' AND Number <= LEN(FieldToStrip) AND
            SUBSTRING(FieldToStrip, Number, 1) LIKE '[0-9]' FOR XML Path(''))
    AS xml) AS varchar(MAX)))
FROM
    SourceTable
Harry
  • 2,636
  • 1
  • 17
  • 29
7
CREATE FUNCTION FN_RemoveNonNumeric (@Input NVARCHAR(512))
RETURNS NVARCHAR(512)
AS
BEGIN
DECLARE @Trimmed NVARCHAR(512)

SELECT @Trimmed = @Input

WHILE PATINDEX('%[^0-9]%', @Trimmed) > 0
    SELECT @Trimmed = REPLACE(@Trimmed, SUBSTRING(@Trimmed, PATINDEX('%[^0-9]%', @Trimmed), 1), '')

RETURN @Trimmed
END

GO

SELECT dbo.FN_RemoveNonNumeric('ABCDE# 123')
Elmer C
  • 71
  • 1
  • 3
3

Here's a version which pulls all digits from a string; i.e. given I'm 35 years old; I was born in 1982. The average family has 2.4 children. this would return 35198224. i.e. it's good where you've got numeric data which may have been formatted as a code (e.g. #123,456,789 / 123-00005), but isn't appropriate if you're looking to pull out specific numbers (i.e. as opposed to digits / just the numeric characters) from the text. Also it only handles digits; so won't return negative signs (-) or periods .).

declare @table table (id bigint not null identity (1,1), data nvarchar(max)) 
insert @table (data) 
values ('hello 123 its 45613 then') --outputs: 12345613
,('1 some other string 98 example 4') --outputs: 1984
,('AB ABCDE # 123') --outputs: 123 
,('ABCDE# 123') --outputs: 123
,('AB: ABC# 123') --outputs: 123
; with NonNumerics as (
    select id
    , data original
    --the below line replaces all digits with blanks
    , replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(data,'0',''),'1',''),'2',''),'3',''),'4',''),'5',''),'6',''),'7',''),'8',''),'9','') nonNumeric
    from @table
)
--each iteration of the below CTE removes another non-numeric character from the original string, putting the result into the numerics column
, Numerics as (
    select id
    , replace(original, substring(nonNumeric,1,1), '') numerics
    , replace(nonNumeric, substring(nonNumeric,1,1), '') charsToreplace
    , len(replace(nonNumeric, substring(nonNumeric,1,1), '')) charsRemaining
    from NonNumerics

    union all

    select id
    , replace(numerics, substring(charsToreplace,1,1), '') numerics
    , replace(charsToreplace, substring(charsToreplace,1,1), '') charsToreplace
    , len(replace(charsToreplace, substring(charsToreplace,1,1), '')) charsRemaining
    from Numerics
    where charsRemaining > 0
)
--we select only those strings with `charsRemaining=0`; i.e. the rows for which all non-numeric characters have been removed; there should be 1 row returned for every 1 row in the original data set.
select * from Numerics where charsRemaining = 0

This code works by removing all the digits (i.e. the characters we want) from a the given strings by replacing them with blanks. Then it goes through the original string (which includes the digits) removing all of the characters that were left (i.e. the non-numeric characters), thus leaving only the digits.

The reason we do this in 2 steps, rather than just removing all non-numeric characters in the first place is there are only 10 digits, whilst there are a huge number of possible characters; so replacing that small list is relatively fast; then gives us a list of those non-numeric characters which actually exist in the string, so we can then replace that small set.

The method makes use of recursive SQL, using common table expressions (CTEs).

JohnLBevan
  • 22,735
  • 13
  • 96
  • 178
2

To add on to Ken's answer, this handles commas and spaces and parentheses

--Handles parentheses, commas, spaces, hyphens..
declare @table table (c varchar(256))
insert into @table
values
('This is a test 111-222-3344'),
('Some Sample Text (111)-222-3344'),
('Hello there 111222 3344 / How are you?'),
('Hello there 111 222 3344 ? How are you?'),
('Hello there 111 222 3344. How are you?')

select
replace(LEFT(SUBSTRING(replace(replace(replace(replace(replace(c,'(',''),')',''),'-',''),' ',''),',',''), PATINDEX('%[0-9.-]%', replace(replace(replace(replace(replace(c,'(',''),')',''),'-',''),' ',''),',','')), 8000),
           PATINDEX('%[^0-9.-]%', SUBSTRING(replace(replace(replace(replace(replace(c,'(',''),')',''),'-',''),' ',''),',',''), PATINDEX('%[0-9.-]%', replace(replace(replace(replace(replace(c,'(',''),')',''),'-',''),' ',''),',','')), 8000) + 'X') -1),'.','')
from @table
S3S
  • 24,809
  • 5
  • 26
  • 45
2

Use this:

REPLACE(TRANSLATE(SomeString, REPLACE(TRANSLATE(SomeString, '0123456789', '##########'), '#', ''), REPLICATE('#', LEN(REPLACE(TRANSLATE(SomeString, '0123456789', '##########'), '#', '') + 'x') - 1)), '#', '')

Demo:

DROP TABLE IF EXISTS #MyTempTable;

CREATE TABLE #MyTempTable (SomeString VARCHAR(255));

INSERT INTO #MyTempTable
VALUES ('ssss123ssg99d362sdg')
    , ('hey 62q&*^(n43')
    , (NULL)
    , ('')
    , ('hi')
    , ('123');

SELECT SomeString
    , REPLACE(TRANSLATE(SomeString, REPLACE(TRANSLATE(SomeString, '0123456789', '##########'), '#', ''), REPLICATE('#', LEN(REPLACE(TRANSLATE(SomeString, '0123456789', '##########'), '#', '') + 'x') - 1)), '#', '')
FROM #MyTempTable;

DROP TABLE IF EXISTS #MyTempTable;

Results:

SomeString (No column name)
ssss123ssg99d362sdg 12399362
hey62q&*^(n43 6243
NULL NULL
hi
123 123
Jesse Hufstetler
  • 583
  • 7
  • 13
  • Excellent. Had lost hope after scrolling through the rest of these that wouldn't work for me for one reason or another. But this looks like it will do the trick brilliantly. – Joel Roberts Jul 27 '23 at 22:30
1
Create function fn_GetNumbersOnly(@pn varchar(100))
    Returns varchar(max)
    AS
    BEGIN
      Declare @r varchar(max) ='', @len int ,@c char(1), @x int = 0
      Select @len = len(@pn)
      while @x <= @len 
      begin
        Select @c = SUBSTRING(@pn,@x,1)
        if ISNUMERIC(@c) = 1 and @c <> '-'
         Select @r = @r + @c
       Select @x = @x +1
      end
    return @r
End
Patrick
  • 5,526
  • 14
  • 64
  • 101
eDriven_Levar
  • 361
  • 3
  • 10
1

In your case It seems like the # will always be after teh # symbol so using CHARINDEX() with LTRIM() and RTRIM() would probably perform the best. But here is an interesting method of getting rid of ANY non digit. It utilizes a tally table and table of digits to limit which characters are accepted then XML technique to concatenate back to a single string without the non-numeric characters. The neat thing about this technique is it could be expanded to included ANY Allowed characters and strip out anything that is not allowed.

DECLARE @ExampleData AS TABLE (Col VARCHAR(100))
INSERT INTO @ExampleData (Col) VALUES ('AB ABCDE # 123'),('ABCDE# 123'),('AB: ABC# 123')

DECLARE @Digits AS TABLE (D CHAR(1))
INSERT INTO @Digits (D) VALUES ('0'),('1'),('2'),('3'),('4'),('5'),('6'),('7'),('8'),('9')

;WITH cteTally AS (
SELECT
    I = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM
    @Digits d10
    CROSS APPLY @Digits d100
    --add more cross applies to cover longer fields this handles 100
)

SELECT *
FROM
    @ExampleData e
    OUTER APPLY (
    SELECT CleansedPhone = CAST((
    SELECT TOP 100
       SUBSTRING(e.Col,t.I,1)
    FROM
       cteTally t
       INNER JOIN @Digits d
       ON SUBSTRING(e.Col,t.I,1) = d.D
    WHERE
       I <= LEN(e.Col)
    ORDER BY
       t.I
    FOR XML PATH('')) AS VARCHAR(100))) o
Matt
  • 13,833
  • 2
  • 16
  • 28
1

Out of all of these solutions, none seemed to take out "ALL" non digit characters. But the method by Clement, as modified below, seems the most reliable, and works with ALL readably available US keyboard chars...

DECLARE @textval NVARCHAR(300)
SET @textval = '1''2A2B2C2d2e2f2~2!2@2#2$2%2^2&2*2(2)2_2+2`2-2=2{2}2|2[2]2\2:2"2;2<2>2?2,2.2/2 3'
SELECT REPLACE(TRANSLATE(@textval, ''' abcdefghijklmnopqrstuvwxyz~!@#$%^&*()_+`-={}|[]\:";<>?,./', '@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'), '@', '')

...produces 1222222222222222222222222222222222222223

kamgman
  • 744
  • 6
  • 9
  • Is this a solution or a comment on Clement's answer? – Suraj Rao Mar 01 '23 at 10:16
  • It is a comment on Clement's solution: {none seemed to take out "ALL" non-digit characters}, followed by my solution that does take out all non-digit characters. – kamgman Mar 02 '23 at 19:28
0
 Declare @MainTable table(id int identity(1,1),TextField varchar(100))
  INSERT INTO @MainTable (TextField)
 VALUES
 ('6B32E')
 declare @i int=1
  Declare @originalWord varchar(100)=''
  WHile @i<=(Select count(*) from @MainTable)
  BEGIN
  Select @originalWord=TextField from @MainTable where id=@i

 Declare @r varchar(max) ='', @len int ,@c char(1), @x int = 0

    Select @len = len(@originalWord)
    declare @pn varchar(100)=@originalWord
    while @x <= @len 
    begin

      Select @c = SUBSTRING(@pn,@x,1)
    if(@c!='')
    BEGIN
            if ISNUMERIC(@c) = 0 and @c <> '-'
    BEGIN
     Select @r = cast(@r as varchar) + cast(replace((SELECT ASCII(@c)-64),'-','') as varchar)

   end
   ELSE
   BEGIN
    Select @r = @r + @c


   END

END


    Select @x = @x +1

    END
    Select @r
  Set @i=@i+1
  END
0

I have created a function for this

Create FUNCTION RemoveCharacters (@text varchar(30))
RETURNS VARCHAR(30)
AS
BEGIN
declare @index as int 
declare @newtexval as varchar(30)
set @index = (select PATINDEX('%[A-Z.-/?]%', @text))
if (@index =0)
begin 
return @text
end
else
begin 
set @newtexval  = (select STUFF ( @text , @index , 1 , '' ))
return dbo.RemoveCharacters(@newtexval)
end
return 0
END
GO
0

Here is the answer:

DECLARE @t TABLE (tVal VARCHAR(100))

INSERT INTO @t VALUES('123')
INSERT INTO @t VALUES('123S')
INSERT INTO @t VALUES('A123,123')
INSERT INTO @t VALUES('a123..A123')


;WITH cte (original, tVal, n)
     AS
     (
         SELECT t.tVal AS original,
                LOWER(t.tVal)  AS tVal,
                65             AS n
         FROM   @t             AS t
         UNION ALL
         SELECT tVal AS original,
                CAST(REPLACE(LOWER(tVal), LOWER(CHAR(n)), '') AS VARCHAR(100)),
                n + 1
         FROM   cte
         WHERE  n <= 90
     )

SELECT t1.tVal  AS OldVal,
       t.tval   AS NewVal
FROM   (
           SELECT original,
                  tVal,
                  ROW_NUMBER() OVER(PARTITION BY tVal + original ORDER BY original) AS Sl
           FROM   cte
           WHERE  PATINDEX('%[a-z]%', tVal) = 0
       ) t
       INNER JOIN @t t1
            ON  t.original = t1.tVal
WHERE  t.sl = 1
Khorshed Alam
  • 314
  • 2
  • 11
0

You can create SQL CLR scalar function in order to be able to use regular expressions like replace patterns.

Here you can find example of how to create such function.

Having such function will solve the issue with just the following lines:

SELECT [dbo].[fn_Utils_RegexReplace] ('AB ABCDE # 123', '[^0-9]', '');
SELECT [dbo].[fn_Utils_RegexReplace] ('ABCDE# 123', '[^0-9]', '');
SELECT [dbo].[fn_Utils_RegexReplace] ('AB: ABC# 123', '[^0-9]', '');

More important, you will be able to solve more complex issues as the regular expressions will bring a whole new world of options directly in your T-SQL statements.

gotqn
  • 42,737
  • 46
  • 157
  • 243
0

While the OP wanted to "strip out anything that is not 0-9", the post is also tagged with "substring" and "patindex", and the OP mentioned the concern "not quite getting the syntax correct or something". To address that the requirements note that "all data fields do have the # prior to the number" and to provide an answer that addresses the challenges with substring/patindex, consider the following:

/* A sample select */
;WITH SampleValues AS
( SELECT 'AB ABCDE # 123' [Columnofdirtydata]
    UNION ALL SELECT 'AB2: ABC# 123')
SELECT
s.Columnofdirtydata,
f1.pos1,
'['+ f2.substr +']' [InspectOutput]
FROM
SampleValues s
CROSS APPLY (SELECT PATINDEX('%# %',s.Columnofdirtydata) [pos1]) f1
CROSS APPLY (SELECT SUBSTRING(s.Columnofdirtydata, f1.pos1 + LEN('#-'),LEN(s.Columnofdirtydata)) [substr]) f2

/* Using update scenario from OP */
UPDATE t1
SET t1.Columntoupdate = CAST(f2.substr AS INT)
FROM
TableName t1
CROSS APPLY (SELECT PATINDEX('%# %',t1.Columnofdirtydata) [pos1]) f1
CROSS APPLY (SELECT SUBSTRING(t1.Columnofdirtydata, f1.pos1 + LEN('#-'),LEN(t1.Columnofdirtydata)) [substr]) f2

Note that my syntax advice for patindex/substring, is to:

  • consider using APPLY as a way to temporarily alias results from one function for use as parameters in the next. It's not uncommon to (in ETL, for example) need to parse out parameter/position-based substrings in an updatable column of a staging table. If you need to "debug" and potentially fix some parsing logic, this style will help.
  • consider using LEN('PatternSample') in your substring logic, to account for reusing this pattern or adjusting it when your source data changes (instead of "+ 1"
  • SUBSTRING() requires a length parameter, but it can be greater than the length of the string. Therefore, if you are getting "the rest of the string" after the pattern, you can just use "The source length"
Sean
  • 91
  • 4
-1

DECLARE @STR VARCHAR(400)

DECLARE @specialchars VARCHAR(50) = '%[~,@,#,$,%,&,*,(,),!^?:]%'

SET @STR = '1, 45 4,3 68.00-'

WHILE PATINDEX( @specialchars, @STR ) > 0

---Remove special characters using Replace function

SET @STR = Replace(Replace(REPLACE( @STR, SUBSTRING( @STR, PATINDEX( @specialchars, @STR ), 1 ),''),'-',''), ' ','')

SELECT @STR

-1
SELECT REGEXP_REPLACE( col, '[^[:digit:]]', '' ) AS new_col FROM my_table
Suraj Rao
  • 29,388
  • 11
  • 94
  • 103
Tamar
  • 1
  • 4
    While this code may solve the question, [including an explanation](//meta.stackexchange.com/q/114762) of how and why this solves the problem would really help to improve the quality of your post, and probably result in more up-votes. Remember that you are answering the question for readers in the future, not just the person asking now. Please [edit] your answer to add explanations and give an indication of what limitations and assumptions apply. – Yunnosch May 02 '22 at 08:05
  • 1
    The question has SQL Server tagged, this is Oracle – Jesse Hufstetler Sep 12 '22 at 17:08