0

I'm trying to perform some data cleansing in SQL. I have a master data table which has a CASE statement for each of the incorrect entries like this

CASE when location LIKE '%gov%' THEN REPLACE <gov with govt>
     when location LIKE '%primery%' THEN REPLACE <primery with primary>

E.g.: I need 'gov primery skool' to become 'Government primary school'. There are around 40 such words that need to be updated with the correct entries. So doing it for each and every combination of these words wouldn't scale up.

I have done some research and found that SQL does not allow a fall through case statement which would have helped in this scenario. Any ideas on how this can be implemented?

TT.
  • 15,774
  • 6
  • 47
  • 88
Abishek
  • 33
  • 6
  • It doesn't seem like you really need a fallthrough switch-case, but if you share the table structure(s) and some sample data, it would be much easier to help you. – Mureinik Oct 18 '16 at 06:24
  • 2
    If you have a table containing two columns - `wrongvalue, rightvalue` then you should be able to craft an `UPDATE` statement that joins this table to the one to be corrected and updates the entire *set* of data. I've no idea where case statements or fall through even come into this. – Damien_The_Unbeliever Oct 18 '16 at 06:24
  • the query is dynamic in nature. So I have a CASE statement located in the master table right now like this which gets picked up based on a condition - CASE WHEN location like '%gov%' THEN REPLACE(location,'gov','govt') I have a WHEN clause for each of the incorrect entries This might not be the best implementation so any workarounds would be appreciated – Abishek Oct 18 '16 at 06:42
  • You'll probably end up with the following: `REPLACE(REPLACE(location,'primery','primary'),'gov','govt')`. That's for two replacements. For three eg also replace `repubs` with `republicans`: `REPLACE(REPLACE(REPLACE(location,'repubs','republicans'),'primery','primary'),'gov','govt')`. And so on... – TT. Oct 18 '16 at 10:29
  • In followup to my previous comment, please refer to [this answer](http://stackoverflow.com/a/35131098/243373) I gave on the question `How can I replace all key fields in a string with replacement values from a table in T-SQL`. The answer shows what ways there are to dynamically build a statement to do all replacements. HTH. – TT. Oct 18 '16 at 11:25
  • @TT. How well would that scale with thousands of possible replacements on thousands of source values, as it appears your solutions try each one on every input string? – iamdave Oct 19 '16 at 11:58
  • @iamdave Hi Dave. For one thing, the way the OP wrote it with CASE ... won't do all replacements. If the string has `gov` and `primery` in it, only the first will be replaced. So you are looking for a construct that does all replacements. The only way I know of to do this efficiently is the answer I gave you (I prefer 1.A.). I have not tested this for thousands of replacements, but also there is no other way than to do REPLACE statements for each substring you want to replace. – TT. Oct 19 '16 at 12:54
  • @iamdave The "problem", if you could call it that, is that recursion is not very effecient in terms of execution time and memory consumption. For small numbers of replacements (eg < 1000) I don't see a problem though. For large numbers of replacements (eg > 10000) I'm thinking you might see performance degradation due to recursion. OTOH, the dynamic `REPLACE(...)` from my answer might run into a maximum number bytes for a SQL query (65,536 * Network Packet size) (cf http://stackoverflow.com/q/1869753/243373). – TT. Oct 19 '16 at 12:58
  • @iamdave I suppose it's all in the testing. Your solution seems good though. – TT. Oct 19 '16 at 13:03

3 Answers3

1

I think the best bet you have here is to maintain a table of misspellings that you want to find and replace within your data. This can either be done with a static reference table (recommended) or if you can't add tables, then a table variable within a procedure will suffice.

With this list of individual words, you can then loop through your data with a recursive cte to replace each misspelling in turn and return the final result. With this approach, you only need to maintain the list of individual words and not entire sentences as per some of the other answers proposed. You will need to insert your data into a temp table so as to retain the row_numbers and counts in the same order, in case you have duplicate rows.

Output from query below:

TextValue             | NewTextValue
----------------------+----------------------
gov primery skool     | govt primary school
Not so incorect name  | Not so incorrect name

-- Create data to use.
declare @Data table (TextValue nvarchar(50));
declare @Lookup table (IncorrectValue nvarchar(50),CorrectValue nvarchar(50));

insert into @Data values
 ('gov primery skool')
,('Not so incorect name');

insert into @Lookup values
 ('gov','govt')
,('primery','primary')
,('skool','school')
,('incorect','incorrect');



-- Actual query.
if object_id('tempdb..#DataRN') is not null
drop table #DataRN;

create table #DataRN (rn int
                        ,cnt int
                        ,cntrn int
                        ,TextValue nvarchar(50)
                        ,IncorrectValue nvarchar(50)
                        ,CorrectValue nvarchar(50)
                        );

-- Create temp table that holds source data as well as some Row Numbers and Counts by TextValue to determine what order to retreive rows and to know which is the final row to pull out at the end.
insert into #DataRN
select row_number() over (order by d.TextValue, l.IncorrectValue, l.CorrectValue) as rn     -- Provides order to select rows in cte below.
    ,row_number() over (partition by d.TextValue
                        order by d.TextValue, l.IncorrectValue, l.CorrectValue) as cntrn    -- Provides ordering within each TextValue, to be compared with 
    ,count(*) over (partition by d.TextValue) as cnt                                        -- The total number of rows returned to complete all REPLACEs on the TextValue to find the last, and therefore complete, NewTextValue in cte.
    ,d.TextValue
    ,l.IncorrectValue
    ,l.CorrectValue
from @Data d
        left join @Lookup l
            on(d.TextValue like '%' + l.IncorrectValue + '%');


-- Recursive cte to apply each find and replace in order.
with cte as
(
    select d.rn
            ,d.cnt
            ,d.cntrn
            ,d.TextValue
            ,cast(replace(d.TextValue,d.IncorrectValue,d.CorrectValue) as nvarchar(50)) as NewTextValue
    from #DataRN d
    where rn = 1

    union all

    select d.rn
            ,d.cnt
            ,d.cntrn
            ,d.TextValue
            ,cast(replace(case when d.TextValue = c.TextValue
                                then c.NewTextValue
                                else d.TextValue
                                end
                            ,d.IncorrectValue
                            ,d.CorrectValue
                            ) as nvarchar(50)) as NewTextValue
    from #DataRN d
        inner join cte c
            on(d.rn = c.rn+1)
) 
select TextValue
        ,NewTextValue
from cte
where cnt = cntrn   -- Where these two values are the same, we know that is the finished product.
order by 1;




if object_id('tempdb..#DataRN') is not null
drop table #DataRN;
iamdave
  • 12,023
  • 3
  • 24
  • 53
0

You need update statement with CASE clause.

UPDATE YourTabel  
SET     YourField=  CASE  
                        WHEN YourField= 'gov primery skool' THEN 'Government primary school' 
                        WHEN YourField= 'ABC' THEN 'XYZ'
                        ELSE YourField
                    END 
Vicky_Burnwal
  • 951
  • 9
  • 14
  • the field could also be 'gov skool primery'. There are various permutations for these entries. Ideally I would like to check for a word in that string and then replace. Essentially look for gov and replace by govt and then look for "primery" and then replace and so on in the same string. – Abishek Oct 18 '16 at 06:47
  • In that scenario use Replace function with '%'. – Vicky_Burnwal Oct 18 '16 at 06:57
  • I did try that. I used '%primery'. The issue is once SQL replaces this it does not proceed to check the next check in the case like '%skool' – Abishek Oct 18 '16 at 06:59
0

You can use OR condition in case statement.

    update TableName
    set Field = case when fieldname='gov primery skool' or fieldname='gov skool primery' then 'Government primary school'
    else fieldname end