MS SQL -- How to Extract Phone Number From Long Inconsistent String

Question

I have a task to complete where I have to import around 970,000 users from someone else's database into ours. In the source DB, there is a phone number field that contains multiple phone numbers concatenated into one hideous string.

Here are some data examples:

|Home: 555-555-5555 Office: (555)-555-5555 Work: 5555555555|
|Home: Office: 555\555-5555 Work: 555-555-5555|
|Office: 555-555-5555 Home: (555)555-5555 some Comment here|

The problems that I'm running into is

The order of the numbers is inconsistent
There are some free text comments strewn about
Some of the phone numbers have different formatting.

I would really prefer to do this through SQL if possible and am pretty stumped on doing this in an efficient manner with minimal Manual adjustment.

In my DB, we have separate columns for each phone number type, so I basically need to split those strings into their appropriate columns.

Please tell me if I left anything out.

I'd not do it in SQL. I'd make an application that pulls out the data, sanitizes it and stores it in a correct model. Trying to solve it in SQL only would take as much time as it would be to do it "easy". — Allan S. Hansen, May 04 '16 at 07:04
I totally agree with Allan. This is going to be a terrible pain in the you-know-where to do in sql. — Zohar Peled, May 04 '16 at 07:07
Agreed. Pick the right tool for the job. This job requires strong text manipulation features, and it's generally agreed that T-SQL's text manipulation tools are among the poorest around. — Damien_The_Unbeliever, May 04 '16 at 07:18
Thanks guys. I'm definitely agreeing on the application route, just really wanted to see if there was a way to do it in SQL. — Sage, May 04 '16 at 08:15

t-clausen.dk · Accepted Answer · 2016-05-04T11:16:49.407

In this code, you will need 3 extra columns to store the new phone numbers

This is the logic in the code

splitting the phone numbers

cutting off text before first 3 last 3 numerics in the split result

deleting the alien characters used in phone number(only those used in sample)

inserting replacement separators '-' at position 7 and 4

grouping up data

updating table

Sample data:

DECLARE @t table
  (phone varchar(500), home varchar(50), work varchar(50), office varchar(50))
INSERT @t(phone) values
('Home: 555-555-5551 Office: (555)-555-5555 Work: 5555555552|'),
('|Home: Office: 555\555-5555 Work: 555-555-5555|'),
('|Office: 555-555-5555 Home: (555)555-5555 some Comment here|')

Update:

;WITH CTE as
(
     SELECT
       nid,work, home, office, 
       t.c.value('.', 'VARCHAR(2000)') phone
     FROM (
         SELECT
           row_number() over(order by (select 1)) nid, work, home,office, 
           x = CAST('<t>' + 
               REPLACE(REPLACE(REPLACE(phone, 'Work', '</t><t>work')
               ,'Office', '</t><t>Office'), 'Home', '</t><t>Home')
                + '</t>' AS XML)
        FROM @t -- replace @t with your table
     ) a
     CROSS APPLY x.nodes('/t') t(c)
     WHERE t.c.value('.', 'VARCHAR(2000)') like '%[0-9][0-9][0-9]%'
), CTE2 as
(
SELECT 
work,max(case when phone like '%work%' then z end) over(partition by nid)nwork,
home,max(case when phone like '%home%' then z end) over(partition by nid)nhome,
office,max(case when phone like '%office%' then z end) over(partition by nid)noffice
FROM cte t
CROSS APPLY(SELECT REVERSE(SUBSTRING(phone,PATINDEX('%[0-9][0-9][0-9]%', phone), 20))x)y
CROSS APPLY(SELECT STUFF(STUFF(REPLACE(REPLACE(REPLACE(REVERSE(
  SUBSTRING(x, PATINDEX('%[0-9][0-9][0-9]%', x), 20)), ')', ''), '\', ''),
  '-', ''),7,0, '-'),4,0,'-')z)v    )
UPDATE CTE2
SET work = nwork, home = nhome, office = noffice

SELECT home,work,office FROM @t

Result:

home          work          office
555-555-5551  555-555-5552  555-555-5555
NULL          555-555-5555  555-555-5555
555-555-5555  NULL          555-555-5555

From the OP - "The order of the numbers is inconsistent" - from this (and the third example), I take it they want to know whether each number is the home, work or office number, specifically. — Damien_The_Unbeliever, May 04 '16 at 07:51
@Damien_The_Unbeliever thanks, I missed that, will try to fix it — t-clausen.dk, May 04 '16 at 07:53
I tried this on over 23000 rows of live data. It took about 5 minutes, but ran successfully and the data is intact. Many thanks, sir. I hope that this helps many others in the future. — Sage, May 09 '16 at 15:14

score 1 · Answer 2 · answered May 04 '16 at 07:38

1

Using T-SQL for this task would not be the best choice. The closest solution would be to create a CLR assembly which would utilise the RegEx functionality available in .NET.

However, you may also look into Data Quality Services. It is a SQL Server component which was created for this particular type of task - manual entry cleansing, unification, de-duplication, etc. It requires BI or Enterprise edtion of SQL Server, however.

answered May 04 '16 at 07:38

Roger Wolf

7,307
2
24
33

Thanks for the tip. I'm using SQL Server 2012 Enterprise so I should be able to use DQS for what I need. Do you think it would be faster to just write a small console app to parse these or use DQS? – Sage May 04 '16 at 09:11
@Sage, I think the answer will very much depend on the complexity of your data, and how many different formats you may encounter. But yes, DQS might have a rather steep learning curve. Also, it would depend on whether it's a one-off run or continuous data flow that needs to be maintained. So it's really up to you. – Roger Wolf May 04 '16 at 11:45

MS SQL -- How to Extract Phone Number From Long Inconsistent String

2 Answers2

Linked