1

I need to build a function to replace multiple substrings for all rows in a table.

Performance is not a big concern, as this is a one-time operation, but there are 48 mappings and roughly 30,000 rows. I know looping over the whole database 48 times is quite stupid, but SQL is not my wheelhouse. If this were Java or C++, it'd be cake.

Basically, I need the SQL analog of the following function. If SQL can't short-circuit loops, that's fine. I've seen the SQL replace function, but encapsulating it properly in a user-defined function is my major stumbling block.

I'm using Microsoft SQL Server if that produces any particular quirks.

mapping[] maps = { {" st ", " Street "}, {" st. ", " Street "}, ...};

for(row r : table) {
    String orig = r.data(colName);
    for(mapping m : maps) {
        r.data(colName).replace(m.first, m.second);
        if(r.data(colName) != orig)
            break;
    }
}
patrickjp93
  • 399
  • 4
  • 20
  • Are the things you'd have in maps constant (i.e. literal), or something dynamic, like from another table? – hatchet - done with SOverflow Jul 19 '17 at 23:22
  • What schema are you trying to update? Can you include the table definitions and some sample data before/after? There's not enough information in here to really answer your question. – pmbAustin Jul 19 '17 at 23:23
  • why do you need to do this in "better" sql if you don't care about performance -- you have a loop you know does exactly what you want -- just use it. – Hogan Jul 19 '17 at 23:25
  • Without knowing more, this can be done by simply chaining calls to `REPLACE`, ex) `REPLACE(REPLACE(colName, ' st. ', ' Street '), ' st ', ' Street ')` – hatchet - done with SOverflow Jul 19 '17 at 23:26
  • @hatchet -- clearly that does not do what the OP is asking for. The code specification is right there! – Hogan Jul 19 '17 at 23:27
  • Hi Hatchet, constant literals. We have messy address data, to put it shortly. I'm cleaning it up. Av/Ave/Ave. goes to Avenue. 35 Rosemont Ringoes Rd goes to 35 Rosemont Ringoes Road. Everything is an nvarchar. Also, I am an HPC guy, so I have a code of ethics about building code that's not crap. If someone can easily draw up the best version, I'd like that one, but I'll settle for good since I'm not an SQL guy to start with. Also, nested function calls for 48 mappings is unmaintainable. So, really? Not to be ungrateful, but that's just bad no matter what language you're using. – patrickjp93 Jul 19 '17 at 23:32
  • I did exactly what you're doing, and just used the chained REPLACE as in my comment above. It doesn't follow your code, because it doesn't quit when it makes the first replacement, but since we were standardizing multiple sorts of address elements (west/w, street/st, trailer/tlr, etc.), that was desired. It works great, and is quite fast...just a little ugly to look at. – hatchet - done with SOverflow Jul 19 '17 at 23:36
  • 1
    @hatchet - I don't think hard coding it with multiple replace is what he was asking for -- an you didn't either -- that is why you put in a comment and not an answer. – Hogan Jul 19 '17 at 23:40
  • Could you put that in a formal answer and show it as a complete solution (i.e. I can highlight it and run it with F5)? – patrickjp93 Jul 19 '17 at 23:40
  • @patrickjp93 - I just posted an answer to a similar question after finding a different approach that is cleaner than nested REPLACE, and faster than recursive CTE. I don't want to double post the same answer, but you can see it here: https://stackoverflow.com/questions/45289325/efficient-cleaning-of-strings-in-a-table/45291498#45291498 – hatchet - done with SOverflow Jul 24 '17 at 23:19

2 Answers2

2

@Hogan has the right idea. This syntax should be closer to working:

WITH map as (
      SELECT v.*
      FROM (VALUES (' st ', ' Street ', 1),
                   (' st. ', ' Street ', 2)
           ) v(str, repstr, n)
     ),
     cte as (
      SELECT replace(t.field, map.str, map.repstr) as field, map.n as n
      FROM t JOIN
           map
           ON map.n = 1
      UNION ALL
      SELECT replace(cte.field, map.str, map.repstr) as field, map.n + 1
      FROM cte JOIN
           map
           ON map.n = cte.n + 1
     )
SELECT field 
FROM (SELECT cte.*, MAX(cte.n) OVER (PARTITION BY cte.field) as maxn
      FROM cte
     ) x
WHERE n = maxn;

You may want to include more fields in the CTE from the original table.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
1
CREATE FUNCTION [dbo].[StandardizeAddress](@address varchar(123))
RETURNS varchar(250)
WITH SCHEMABINDING
AS
BEGIN
    RETURN
        REPLACE(REPLACE(
                @address + ' '
                , ' st ', ' Street')
                , ' st. ', ' Street ')
END

Creating a scalar function like this is how we did this. Using the code above to compute addresses from a table of 171,000 rows took 240 ms. Using our actual function, which has more than 80 substitutions, and does some other manipulations takes 5 seconds for 171,000 rows. However, we store the standardized version of addresses, because we are doing complex person searches and precompute standardized values for performance sake. So the function is only run once when a row is added or an address modified, so the speed of this function is not an issue.

For comparison, Gordon's solution takes 4.5 seconds against the same dataset (vs. 240 ms for chained REPLACE). With 4 replacements instead of two, the CTE solution takes 7.8 seconds, vs. 275 ms for REPLACE.

I need to add a caveat that there is a limit to how many function calls can be nested. According to another question on stackOverflow, the limit is 244, which is a fair amount larger than the default max recursion limit for recursive CTEs.

Another option that's a bit slower (about 75% more time) than nested REPLACE functions is something like the following:

select c3.address from (select REPLACE(@address, ' st ', ' Street ') address) c1
        cross apply (select REPLACE(c1.address, ' st. ', ' Street ') address) c2
        cross apply (select REPLACE(c2.address, ' dr ', ' Drive ') address) c3

But I don't see much advantage for that. You could also write a c# CLR function, but I suspect that the calling overhead might make it slower than just using the nested REPLACE calls.

Edit- Since posting this I posted an answer to a similar question that seems to be in the speed ballpark of nested REPLACE, but is much cleaner (and more maintainable) code.

  • Since speed isn't much an issue wouldn't a solution with a map table and a recursive clean up be much easier to maintain? – Hogan Jul 20 '17 at 00:15
  • The version we have is pretty easy to maintain. And we don't have to worry about the recursion limit. Most developers understand REPLACE. Not as many understand recursive Common Table Expressions. – hatchet - done with SOverflow Jul 20 '17 at 00:23
  • That's right SQL Server has a recursion limit. You don't need people to look at the CTE -- just a mapping table with `replace_target` and `replace_with` columns. – Hogan Jul 20 '17 at 01:59
  • Well, the REPLACE version is an order of magnitude faster with a small number of mappings (and possibly orders of magnitude with a large set of mappings), is easy to maintain, easy to understand, and works. That's good enough for me. – hatchet - done with SOverflow Jul 20 '17 at 02:46
  • and when you change the final select to `SELECT TOP 1 field FROM cte ORDER BY n DESC` from Gordon's solution? – Hogan Jul 20 '17 at 21:55
  • @Hogan - it takes about still takes over 7 seconds with 4 replacements, but returns just 1 row, instead of returning all 171,000 converted rows. – hatchet - done with SOverflow Jul 20 '17 at 22:32