0

I'm trying to replace anything between 2 specific characters in a string that contains multiples of those 2 caracters. Take it as a csv format.

Here an example of what i got as data in that field:

0001, ABCD1234;0002, EFGH432562;0003, IJKL1345hsth;...

What I need to retreive from it is all parts before the ',' but not what are between ',' and ';'

I tried with those formula but no success

 SELECT REPLACE(fieldname, ',[A-Z];', ' ') FROM ...
 or
 SELECT REPLACE(fieldname, ',*;', ' ') FROM ...

I need to get

0001 0002 0003

Is there a way to achieve that?

Eric Brochu
  • 109
  • 6
  • What is the expected result? – Zhorov Jan 11 '22 at 14:43
  • so from your sample data what is the expected outcome ? – GuidoG Jan 11 '22 at 14:43
  • 1
    The *real* question is why are you storing delimited data in your database in the first place. `REPLACE`, however, doesn't support patterns, only literals. There is no pattern or regex replacement functionality built into T-SQL.# – Thom A Jan 11 '22 at 14:44
  • What you need to do is fix the data. SQL is a *query* language, not a text manipulation language. Even if you could parse that string you wouldn't be able to query the data efficiently because it can't be indexed. – Panagiotis Kanavos Jan 11 '22 at 14:51
  • `I need to get ... Is there a way to achieve that?` none of the options involves wildcards and replacing. 1) Parse the data before inserting it in the database, and use proper tables. That's the fastest and most flexible option, allowing you to index the fields and easily query the data 2) Store the data as XML or JSON and use XML/JSON functions to parse it. You won't benefit from indexes but at least you could write a proper query 3) Read and parse the data on the client. – Panagiotis Kanavos Jan 11 '22 at 14:53
  • Thx guys, imma talk about it to the dev team. – Eric Brochu Jan 11 '22 at 14:53
  • Even if what you ask could somehow be done, it would be *very* slow - without indexes, the server has to read every single row in the table and parse it. Since SQL isn't a text manipulation language it's *very* weak when it comes to text processing. Even if SQL Server had regular expressions (it doesn't) your query would still have to process the entire table. – Panagiotis Kanavos Jan 11 '22 at 14:58
  • On the other hand, you can have up to 32K rows in a table using [sparse columns](https://learn.microsoft.com/en-us/sql/relational-databases/tables/use-sparse-columns?view=sql-server-ver15). Using them would allow you to use proper SQL and even index specific columns. Even if you stored the "fields" as JSON you could use computed columns to extract specific values, and even index those persisted columns – Panagiotis Kanavos Jan 11 '22 at 14:58
  • While asking a question, you need to provide a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example): (1) DDL and sample data population, i.e. CREATE table(s) plus INSERT T-SQL statements. (2) What you need to do, i.e. logic and your code attempt implementation of it in T-SQL. (3) Desired output, based on the sample data in the #1 above. (4) Your SQL Server version (SELECT @@version;). – Yitzhak Khabinsky Jan 11 '22 at 15:02
  • 1
    BTW all supported SQL Server versions have [STRING_SPLIT](https://learn.microsoft.com/en-us/sql/t-sql/functions/string-split-transact-sql?view=sql-server-ver15) but that would still result in a very slow query. You'd have to split first by `;` and then by `,`. Worse, there's no guaranteed order or item count, unless you use Azure SQL. – Panagiotis Kanavos Jan 11 '22 at 15:05

2 Answers2

2

You can CROSS APPLY to a STRING_SPLIT that uses STRING_AGG (since Sql Server 2017) to stick the numbers back together.

select id, codes
from your_table
cross apply (
  select string_agg(left(value, patindex('%_,%', value)), ' ') as codes
  from string_split(fieldname, ';') s
  where value like '%_,%'
) ca;
GO
id codes
1 0001 0002 0003

Demo on db<>fiddle here

Extra

Here is a version that also works in Sql Server 2014.
Inspired by the research from @AaronBertrand
The UDF uses a recursive CTE to split the string.
And the FOR XML trick is used to stick the numbers back together.

CREATE FUNCTION dbo.fnString_Split
(
    @str    nvarchar(4000), 
    @delim  nchar(1)
)
RETURNS TABLE 
WITH SCHEMABINDING 
AS 
RETURN
(
  WITH RCTE AS (
    SELECT 
      1 AS ordinal
    , ISNULL(NULLIF(CHARINDEX(@delim, @str),0), LEN(@str)) AS pos
    , LEFT(@str, ISNULL(NULLIF(CHARINDEX(@delim, @str),0)-1, LEN(@str))) AS value
    UNION ALL
    SELECT 
      ordinal+1
    , ISNULL(NULLIF(CHARINDEX(@delim, @str, pos+1), 0), LEN(@str))
    , SUBSTRING(@str, pos+1, ISNULL(NULLIF(CHARINDEX(@delim, @str, pos+1),0)-pos-1, LEN(@str)-pos )) 
    FROM RCTE
    WHERE pos < LEN(@str)
  ) 
  SELECT ordinal, value
  FROM RCTE
);
SELECT id, codes
FROM your_table
CROSS APPLY (
  SELECT RTRIM((
       SELECT LEFT(value, PATINDEX('%_,%', value))+' '
       FROM dbo.fnString_Split(fieldname, ';') AS spl
       WHERE value LIKE '%_,%'
       ORDER BY ordinal
       FOR XML PATH(''), TYPE).value(N'./text()[1]', N'nvarchar(max)')
    ) AS codes
) ca
OPTION (MAXRECURSION 250);
id codes
1 0001 0002 0003

Demo on db<>fiddle here

Alternative version of the UDF (no recursion)

CREATE FUNCTION dbo.fnString_Split
(   
  @str   NVARCHAR(4000),
  @delim NCHAR(1)
)
RETURNS @tbl TABLE (ordinal INT, value NVARCHAR(4000))
WITH SCHEMABINDING
AS
BEGIN
  DECLARE @value NVARCHAR(4000)
        , @pos INT = 0
        , @ordinal INT = 0;
  WHILE (LEN(@str) > 0)
  BEGIN
    SET @ordinal += 1;
    SET @pos = ISNULL(NULLIF(CHARINDEX(@delim, @str),0), LEN(@str)+1);
    SET @value = LEFT(@str, @pos-1);
    SET @str = SUBSTRING(@str, @pos+1, LEN(@str));
    INSERT INTO @tbl (ordinal, value) 
              VALUES (@ordinal, @value);
  END;
  RETURN;
END;
LukStorms
  • 28,916
  • 5
  • 31
  • 45
  • very interesting, will try it – Eric Brochu Jan 11 '22 at 15:45
  • Just be aware that, while unlikely, it could come back `0002 0003 0001` or `0003 0001 0002` so, if order matters, you might need something more robust that guarantees order. If "usually works ok" is sufficient and an anomaly is not a deal-breaker, this is totally fine. – Aaron Bertrand Jan 11 '22 at 19:44
  • @AaronBertrand I know that there's this agreement that without the ordinal flag that one can't be 100% sure that string_split returns it in the same order. But I've yet to see proof that it can mess up the order in small sets. Not that it probably matters in this case. Plus the OP can include an `ORDER BY` in the cross apply. – LukStorms Jan 11 '22 at 19:52
  • I understand you might not be able to _prove_ it can come out in the wrong order, but, I mean, they invested in the enable_ordinal argument for a reason. I agree that it may not be important to the OP in this case, and I upvoted, but I still think it's important to note. Also what would they order by in the cross apply? Imagine the values are `cow, zebra, shark` and not a sequence easily sortable on its own. – Aaron Bertrand Jan 11 '22 at 19:54
  • @AaronBertrand Ok, they really should also enable that ordinal flag in Sql Server. It's a no-brainer really. I've seen solutions where they had to use things like row_number instead. Stuff like that pains the hidden golfcoder in me. – LukStorms Jan 11 '22 at 19:59
  • Next version. It is extremely unlikely to be back-ported to older versions. – Aaron Bertrand Jan 11 '22 at 20:07
1

If you're on SQL Server 2017 and don't need a guarantee that the order will be maintained, then LukStorms' answer is perfectly adequate.

However, if you:

  • care about an order guarantee; or,
  • are on an older version than 2017 (and can't use STRING_AGG); or,
  • are on an even older version than 2016 or are in an older compatibility level (and can't use STRING_SPLIT):

Here's an ordered split function that can help (it's long and ugly but you only have to create it once):

CREATE FUNCTION dbo.SplitOrdered
(
    @list    nvarchar(max), 
    @delim   nvarchar(10)
)
RETURNS TABLE 
WITH SCHEMABINDING 
AS 
RETURN
(
  WITH w(n) AS (SELECT 0 FROM (VALUES (0),(0),(0),(0)) w(n)),
       k(n) AS (SELECT 0 FROM w a, w b),
       r(n) AS (SELECT 0 FROM k a, k b, k c, k d, k e, k f, k g, k h),
       p(n) AS (SELECT TOP (COALESCE(LEN(@list), 0)) 
                ROW_NUMBER() OVER (ORDER BY @@SPID) -1 FROM r),
       spots(p) AS 
       (
         SELECT n FROM p 
         WHERE (SUBSTRING(@list, n, LEN(@delim + 'x') - 1) LIKE @delim OR n = 0)
       ),
       parts(p,val) AS 
       (
         SELECT p, SUBSTRING(@list, p + LEN(@delim + 'x') - 1, 
           LEAD(p, 1, 2147483647) OVER (ORDER BY p) - p - LEN(@delim)) 
         FROM spots AS s
       )
       SELECT listpos = ROW_NUMBER() OVER (ORDER BY p), 
              Item    = LTRIM(RTRIM(val))
         FROM parts
);

Then the query can become:

;WITH x AS 
(
  SELECT id, listpos, 
    codes = LEFT(Item, COALESCE(NULLIF(CHARINDEX(',', Item),0),1)-1)
  FROM dbo.your_table
  CROSS APPLY dbo.SplitOrdered(fieldname, ';') AS c
)
SELECT id, codes = (
  (SELECT x2.codes + ' '
    FROM x AS x2
     WHERE x2.id = x.id
     ORDER BY x2.listpos
     FOR XML PATH(''), TYPE).value(N'./text()[1]', N'nvarchar(max)')
)
FROM x GROUP BY id;

Note that, in addition to guaranteeing order and being backward compatible (well, only back so many versions), it also ignores garbage data, e.g. try:

0001, ABCD1234;0002 but no comma
Aaron Bertrand
  • 272,866
  • 37
  • 466
  • 490
  • Ok, now I'm starting to wonder which of the UDF methods to replace string_split is the fastest. ;) The use of `@@SPID` in this one is... surprising. – LukStorms Jan 11 '22 at 20:34
  • @LukStorms Ultimately, it doesn't matter, because we should all be striving to not have SQL Server split strings in the first place (e.g. use TVPs) or, failing that, to be on a version modern enough to use native methods instead of UDFs or CLR. But I have done some perf comparisons: https://sqlblog.org/split and some background on `@@SPID`/`@@TRANCOUNT` [here](https://sqlperformance.com/2021/04/t-sql-queries/number-series-solutions-4). – Aaron Bertrand Jan 11 '22 at 20:38
  • Ok, wow. So basically CLR wins overall (as expected). And recursive CTE is comparable with XML method. Thanks. – LukStorms Jan 11 '22 at 20:49
  • @LukStorms err, I don't know that "CLR wins" is what I would have taken from any of those posts, but ok. :-) If you're stuck on an ancient version and you have the flexibility to use CLR and you're prepared to ditch it into the pram when you move to cloud, then yeah, maybe CLR "wins." I've been fighting against CLR for ages, even though it did have the best performance in a couple of individual use cases before a native implementation was available. – Aaron Bertrand Jan 11 '22 at 20:54
  • @LukStorms The article where I declared CLR the winner was originally written in 2012. I didn't redo the performance tests when I updated the article a few months ago, but I should have. STRING_SPLIT is a much better alternative for simplicity, performance, and portability / forward compatibility. If you really want to optimize the performance of splitting strings in SQL Server, though, don't split strings in SQL Server at all: [use table-valued parameters](https://sqlperformance.com/2012/08/t-sql-queries/splitting-strings-now-with-less-t-sql). – Aaron Bertrand Jan 11 '22 at 21:00