How to replace a pattern substring in a long string with a value from another table in SQL Server 2016

Question

I am running into issue of Replace function in SQL Server 2016 where I can't replace with any pattern. The problem is:

I have one big string inside there are couple of queries which are separated by a pattern of @@ @@ like below:

I have a question. @@What is your name@@@@what is your age@@I need more answers.@@How are you@@.Answer if you want.

Now, the text inside @@ @@ is a query. The answers of this query are saved in another table of database.

I want to replace the queries from the text with the table values. We can run a cursor from the table to get the values and then replace with particular position. Let's say we have values for the queries as below in a table (Answers) where we can get the data in the order of the queries:

What is your name -> Doe
what is your age -> 29
How are you -> I am fine

The final answer will be: I have a question. Doe29I need more answers.I am fine. Answer if you want.

Now, I am finding difficulty on replacing each query between @@ @@ in SQL Server 2016.

Can anyone please help me out on which SQL Server function I would get the functionality of replacing by pattern and replace the query with original values from a table?

Sander · Answer 1 · 2020-07-26T09:09:23.530

A solution with SQL cursors must exist, but is not required for this problem. The solution below uses two recusive common table expressions (recursive CTE) and some string functions. The complete solution may seem daunting, so I have split the approach behind it in three steps.

Sample data

declare @question table
(
    question nvarchar(200) primary key clustered
);

insert into @question (question) values
('I have a question. @@What is your name@@@@what is your age@@I need more answers.@@How are you@@.Answer if you want.');

declare @answers table
(
    key_ nvarchar(30) primary key clustered,
    value_ nvarchar(30)
);

insert into @answers (key_, value_) values
('What is your name', 'Doe'),
('what is your age', '29'),
('How are you', 'I am fine');

Complete solution

with cte_position as
(
    select  charindex('@@', q.question, 0) as 'position',
            1 as 'count_',
            convert(nvarchar(30), null) as 'key_',
            q.question
    from @question q
        union all
    select  charindex('@@', cp.question, cp.position+2),
            cp.count_+1,
            convert(nvarchar(30), case
                when (cp.count_+1) % 2 = 0
                then substring(cp.question, cp.position+2, charindex('@@', cp.question, cp.position+2)-cp.position-2)
                else null
            end),
            cp.question
    from cte_position cp
    where charindex('@@', cp.question, cp.position+2) > 0
),
cte_replace as
(
    select  cp.count_,
            cp.key_,
            replace(cp.question, '@@'+cp.key_+'@@', a.value_) as 'newquestion'
    from cte_position cp
    join @answers a
        on a.key_ = cp.key_
    where cp.key_ is not null
      and not exists (  select top 1 'x'
                        from cte_position cp2
                        where cp2.key_ is not null
                          and cp2.count_ < cp.count_ )
        union all
    select  cp.count_,
            cp.key_,
            replace(cr.newquestion, '@@'+cp.key_+'@@', a.value_)
    from cte_replace cr
    join cte_position cp
        on cp.count_ = cr.count_ + 2
    join @answers a
        on a.key_ = cp.key_
)
select top 1 cr.newquestion
from cte_replace cr
order by cr.count_ desc;

Part 1 - Extract keys to replace

Structure of the recusive CTE cte_position:

Take the single question record.
Look for the first occurence of @@, so starting from position 0.
Start a counter at 1.
(Ignore key_ for now, set to null)
Repeat the question value.

For the recursion part:

Take the previously record (steps 2. through 5. above).
Look for the next occurence of @@, so starting past the previous position = cp.position+2. The +2 accounts for the separator @@ being 2 characters wide.
Increment the counter with +1.
When the counter has reached an even number, then extract the text between the previous and current occurence of @@ as a key_ to replace. For odd counts there is no key_ value, set null.
Repeat the question value.
Keep doing this as long as a next occurence of @@ can be found.

Intermediate result

position    count_      key_                           question
----------- ----------- ------------------------------ ---------------------------------------------------------------------------------------------------------------------
20          1           NULL                           I have a question. @@What is your name@@@@what is your age@@I need more answers.@@How are you@@.Answer if you want.
39          2           What is your name              I have a question. @@What is your name@@@@what is your age@@I need more answers.@@How are you@@.Answer if you want.
41          3           NULL                           I have a question. @@What is your name@@@@what is your age@@I need more answers.@@How are you@@.Answer if you want.
59          4           what is your age               I have a question. @@What is your name@@@@what is your age@@I need more answers.@@How are you@@.Answer if you want.
81          5           NULL                           I have a question. @@What is your name@@@@what is your age@@I need more answers.@@How are you@@.Answer if you want.
94          6           How are you                    I have a question. @@What is your name@@@@what is your age@@I need more answers.@@How are you@@.Answer if you want.

Part 2 - Replace keys with values

Structure of the recusive CTE cte_replace:

Select only intermediate records with a key to replace = cp.key_ is not null.
Take the first record, being the one with the lowest count (so a record with a lower count_ must not exist).
Select the counter.
Select the key to replace.
Join with @answer to perform the actual replace and rename the result to newquestion.

For the recursion part:

Take the previously record (steps 1. through 5. above).
Get the next key by joining with cte_positions and filtering on the record with a count_ that is 2 bigger (we selected only even rows in step 1.).
Join with @answer to perform the actual replacement action.
The recursion stops when there are no more key_'s found in @answer to replace.

Intermediate result

count_      key_                           newquestion
----------- ------------------------------ --------------------------------------------------------------------------------------------------
2           What is your name              I have a question. Doe@@what is your age@@I need more answers.@@How are you@@.Answer if you want.
4           what is your age               I have a question. Doe29I need more answers.@@How are you@@.Answer if you want.
6           How are you                    I have a question. Doe29I need more answers.I am fine.Answer if you want.

Part 3 - Select final result

The easiest part: select the record from the last result set with the highest count, so the first (top 1) record when sorting on count descending (cr.count_ desc).

Result

newquestion
-----------------------------------------------------------------------------
I have a question. Doe29I need more answers.I am fine.Answer if you want.

Well prepared answer, well done. Recursion, however, will be slow and is a bit complex for this type of thing. — Alan Burstein, Jul 26 '20 at 00:37
Performance all depends on the size and complexity of the data set combined with the available system resources. Hard to validate without more sample data. I did not consider predefined functions for my solution. — Sander, Jul 26 '20 at 08:48
@AlanBurstein Did some quick testing with the single row of sample data. If I define my table variables like you, with `primary key`'s (solution updated), then my query plan has a cost (0,0547064) of about double of your `delimitedSplit8K` solution (0,0253554). So your solution would be the preferred approach if functions are an option. — Sander, Jul 26 '20 at 09:07
To be clear - I like you solution and I like recursiveCTEs for some things. I added a performance test harness above. I would have to turn your logic into a function which became a little more time consuming than planned. I suspect the rCTE solution will be slower - it will definitely generate more IO. — Alan Burstein, Jul 26 '20 at 14:49

Alan Burstein · Answer 2 · 2020-07-26T14:45:47.730

Updated 20200725

No cursors, loops OR recursive CTEs required. A cursor or recursive solution will almost certainly be slow and overly bulky. A set-based solution, leveraging DelimitedSplit8K or using patextract8K is the way to go.

Note the updated sample data (which includes more @'s):

-- 1. Sample Data
declare @question table
(
  qID      int  PRIMARY KEY CLUSTERED,
  question nvarchar(200) NOT NULL
);
declare @answers table
(
  aID    int          NOT NULL INDEX nc__answers__aID,
  key_   nvarchar(30) NOT NULL,
  value_ nvarchar(30) NOT NULL
);
insert into @question (qID,question) values
(1,'I have a question. @@What is your name@@@@what is your age@@I need more answers.'+
   '@@@@@@@@@@@@@@@@@@@How are you@@@@@@@@@@@@@@@@@@@@@@@@@@@.Answer if you want.');
insert into @answers (aID,key_, value_) values
(1,'What is your name','Doe'),(1,'what is your age','29'),(1,'How are you','I am fine');

-- 2. delimitedSplit8K example
SELECT FinalAnswer = 
(
  SELECT      ISNULL(a.value_,split.item)
  FROM        @question                            AS q
  CROSS APPLY dbo.delimitedSplit8K(q.question,'@') AS split
  LEFT JOIN   @answers                             AS a
    ON        a.aID = q.qID
    AND       a.key_ = split.item
  ORDER BY    split.ItemNumber
  FOR XML PATH('')
);

-- 3. Patextract8K example 
SELECT FinalAnswer = 
(
  SELECT      ISNULL(a.value_,split.item)
  FROM        @question                         AS q
  CROSS APPLY samd.patExtract8K(q.question,'@') AS split
  LEFT JOIN   @answers                          AS a
    ON        a.aID = q.qID
    AND       a.key_ = split.item
  ORDER BY    split.ItemNumber
  FOR XML PATH('')
);

Results for each solution:

FinalAnswer
--------------------------------------------------------------------------
I have a question. Doe29I need more answers.I am fine.Answer if you want.

The benefit of the extractor method over the splitter method is that the extractor method "skips" the delimiters (@). Compare the execution plans:

DelimitedSplit8K Execution plan

patExtract8K Execution plan

I added some extra delimiters in the sample data to show how, using the extractor it extracts the 6 required items and begins processing only six rows moving forward (instead of 55).

Note that, in this case, we are dealing with only one record. To better understand how to do this with multiple rows you would need to have better sample data.

Final Edit 20200726 (Performance Test):

Based on a discussion in this thread. I included two different splitters which could be used for this and tested them against a bunch of rows leveraging both a serial and parallel execution plan.

Sample Data

-- 1. Sample Data
SET NOCOUNT ON;
IF OBJECT_ID('dbo.question','U') IS NOT NULL DROP TABLE dbo.question;
IF OBJECT_ID('dbo.answers','U')  IS NOT NULL DROP TABLE dbo.answers;
CREATE TABLE dbo.question
(
  qID      int  PRIMARY KEY CLUSTERED,
  question nvarchar(200) NOT NULL
);
CREATE TABLE dbo.answers
(
  aID    int          NOT NULL INDEX nc__answers__aID,
  key_   nvarchar(30) NOT NULL,
  value_ nvarchar(30) NOT NULL
);
insert into dbo.question (qID,question) values
(1,'I have a question. @@What is your name@@@@what is your age@@I need more answers.'+
   '@@@@@@@@@@@@@@@@@@@How are you@@@@@@@@@@@@@@@@@@@@@@@@@@@.Answer if you want.'),
(2,'I have a question too.@@@@@@@@@@@@Your Name is?@@@how young are you?@@Answer This-'+
   '@@@@@@@@@@@You good?@@@@@@@@@@@@@@@How is Bill@@@@@@@@Answer if you must!.');
insert into dbo.answers (aID,key_, value_) values
(1,'What is your name','Doe'),(1,'what is your age','29'),(1,'How are you','I am fine'),
(2,'Your Name is?','Fred'),(2,'how young are you?','38'),(2,'Answer This-','What?'),
(2,'You good?','Yeah!'),(2,'How is Bill','Sleepy');

-- 2. Add Many New rows for performance testing
DECLARE @multiplier BIGINT = 10000;

INSERT dbo.question
SELECT     ROW_NUMBER() OVER (ORDER BY (SELECT NULL))+2, q.question
FROM       dbo.question                    AS q
CROSS JOIN core.rangeAB(1,@multiplier,1,1) AS r;

INSERT     dbo.answers
SELECT     a.aID%2+((a.aID+r.RN)*2), a.key_, a.value_
FROM       dbo.answers                     AS a
CROSS JOIN core.rangeAB(1,@multiplier,1,0) AS r;

Performance Test

-- 3. Performance Test
BEGIN
  PRINT CHAR(10)+'Start Test'+CHAR(10)+REPLICATE('-',90);
    IF OBJECT_ID('tempdb..#t1') IS NOT NULL DROP TABLE #t1;
    IF OBJECT_ID('tempdb..#t2') IS NOT NULL DROP TABLE #t2;
    IF OBJECT_ID('tempdb..#t3') IS NOT NULL DROP TABLE #t3;
    IF OBJECT_ID('tempdb..#t4') IS NOT NULL DROP TABLE #t4;
  
  PRINT CHAR(10)+'delimitedSplit8K serial'+CHAR(10)+REPLICATE('-',90);
  
  WITH q2 AS (SELECT q.qID FROM dbo.question AS q GROUP BY q.qID)
  SELECT
    q2.qID,
    FinalAnswer = fa.FinalAnswer
  INTO   #t1
  FROM   q2
  CROSS APPLY
  (
    SELECT      ISNULL(a.value_,split.item)
    FROM        (SELECT q.* FROM dbo.question AS q WHERE q.qID = q2.qID) AS q
    CROSS APPLY dbo.delimitedSplit8K(q.question,'@')                     AS split
    LEFT JOIN   (SELECT a.* FROM dbo.answers  AS a WHERE a.aID = q2.qID) AS a
      ON        a.aID  = q.qID
      AND       a.key_ = split.item
    ORDER BY    split.ItemNumber
    FOR XML PATH('')
  ) AS fa(FinalAnswer)
  OPTION (MAXDOP 1);
  
  PRINT CHAR(10)+'delimitedSplit8K parallel'+CHAR(10)+REPLICATE('-',90);
  
  WITH q2 AS (SELECT q.qID FROM dbo.question AS q GROUP BY q.qID)
  SELECT
    q2.qID,
    FinalAnswer = fa.FinalAnswer
  INTO   #t2
  FROM   q2
  CROSS APPLY
  (
    SELECT      ISNULL(a.value_,split.item)
    FROM        (SELECT q.* FROM dbo.question AS q WHERE q.qID = q2.qID) AS q
    CROSS APPLY dbo.delimitedSplit8K(q.question,'@')                     AS split
    LEFT JOIN   (SELECT a.* FROM dbo.answers  AS a WHERE a.aID = q2.qID) AS a
      ON        a.aID  = q.qID
      AND       a.key_ = split.item
    ORDER BY    split.ItemNumber
    FOR XML PATH('')
  ) AS fa(FinalAnswer)
  OPTION (QUERYTRACEON 8649);
  
  PRINT CHAR(10)+'patExtract8K Serial'+CHAR(10)+REPLICATE('-',90);
  
  WITH q2 AS (SELECT q.qID FROM dbo.question AS q GROUP BY q.qID)
  SELECT
    q2.qID,
    FinalAnswer = fa.FinalAnswer
  INTO   #t3
  FROM   q2
  CROSS APPLY
  (
    SELECT      ISNULL(a.value_,split.item)
    FROM        (SELECT q.* FROM dbo.question AS q WHERE q.qID = q2.qID) AS q
    CROSS APPLY samd.patExtract8K(q.question,'@')                        AS split
    LEFT JOIN   (SELECT a.* FROM dbo.answers  AS a WHERE a.aID = q2.qID) AS a
      ON        a.aID = q.qID
      AND       a.key_ = split.item
    ORDER BY    split.ItemNumber
    FOR XML PATH('')
  ) AS fa(FinalAnswer)
  OPTION (MAXDOP 1);
  
  PRINT CHAR(10)+'patExtract8K Parallel'+CHAR(10)+REPLICATE('-',90);
  WITH q2 AS (SELECT q.qID FROM dbo.question AS q GROUP BY q.qID)
  SELECT
    q2.qID,
    FinalAnswer = fa.FinalAnswer
  INTO   #t4
  FROM   q2
  CROSS APPLY
  (
    SELECT      ISNULL(a.value_,split.item)
    FROM        (SELECT q.* FROM dbo.question AS q WHERE q.qID = q2.qID) AS q
    CROSS APPLY samd.patExtract8K(q.question,'@')                        AS split
    LEFT JOIN   (SELECT a.* FROM dbo.answers  AS a WHERE a.aID = q2.qID) AS a
      ON        a.aID = q.qID
      AND       a.key_ = split.item
    ORDER BY    split.ItemNumber
    FOR XML PATH('')
  ) AS fa(FinalAnswer)
  OPTION (QUERYTRACEON 8649);
END

Results:

delimitedSplit8K serial
------------------------------------------------------------------------------------------
 SQL Server Execution Times: CPU time = 7641 ms,  elapsed time = 7643 ms.

delimitedSplit8K parallel
------------------------------------------------------------------------------------------
 SQL Server Execution Times: CPU time = 12454 ms,  elapsed time = 2118 ms.

patExtract8K Serial
------------------------------------------------------------------------------------------
 SQL Server Execution Times: CPU time = 4172 ms,  elapsed time = 4216 ms.

patExtract8K Parallel
------------------------------------------------------------------------------------------
 SQL Server Execution Times: CPU time = 7453 ms,  elapsed time = 1343 ms.

For this, a set-based solution. Patextract works here because of the extra long delimiters. I tried the Recursive CTE solution below but could not get it to work; in part, because the logic would need to live in a function which is more work than I have time for. Nonetheless, rCTEs are slow for counting and for splitting strings (in most cases.)

How to replace a pattern substring in a long string with a value from another table in SQL Server 2016

2 Answers2