Split String into given length but do not split word

Question

I want to split a string into groups of rows having lengths NOT more than 30, but while splitting, the last word should not break, if it breaks, move that word to the next row set.

Following is the sample string:

DECLARE @FullAddress VARCHAR(MAX)= '216 Apartment123 AreaArea SampleWord1 Word2 MiddleTown Upper1Location Another5 NewYork'

Required Output:

+-----+------------------------------+
| Row |            Result            |
+-----+------------------------------+
|   1 | 216 Apartment123 AreaArea    |
|   2 | SampleWord1 Word2 MiddleTown |
|   3 | Upper1Location Another5      |
|   4 | NewYork                      |
+-----+------------------------------+

I tried with substring, but it breaks the word at a fixed length

DECLARE @FullAddress VARCHAR(MAX)= '216 Apartment123 AreaArea SampleWord1 Word2 MiddleTown UPLocation Downtownocation NewYork'
SELECT SUBSTRING(@FullAddress,1,30)
UNION ALL
SELECT SUBSTRING(@FullAddress,31,30)
UNION ALL
SELECT SUBSTRING(@FullAddress,61,30)

This problem would best be handled outside of SQL Server, which basically has zero regex support. — Tim Biegeleisen, Aug 30 '21 at 11:58
I suspect any solution in T-SQL is going to require recursion; which is unlikely to be performant. — Thom A, Aug 30 '21 at 12:02

score 2 · Accepted Answer · answered Aug 30 '21 at 14:37

Assuming you have a table of addresses, I'd use a recursive CTE.

On each iteration, find the last possible space to break on, then start the next iteration for the character after the space.

take 31 characters
reverse them
find the position of the first space

Extra care to be taken for corner cases:

The remaining string to be searched is less than 30 characters
The current string being searched has no space in the first 31 characters

Using the following test data...

CREATE TABLE test (
  address   VARCHAR(MAX)
);

INSERT INTO
  test
VALUES
  ('216 Apartment123 AreaArea SampleWord1 Word2 MiddleTown Upper1Location Another5 NewYork'),
  ('216 Apartment123 AreaArea SampleWord1 Word2 MiddleTownxx Upper1LocationUpper1LocationUpper1Location Another5 NewYork'),
  ('216 Apartment123 AreaArea SampleWord1 Word2 MiddleTownxx Upper1LocationUpper1LocationUpper1Location Another5 NewYork x')

;

Using the following CTE...

DECLARE @chars BIGINT = 30;

WITH
  parts AS
(
  SELECT
    address,
    LEN(address)        AS length,
    CAST(0 AS BIGINT)   AS last_space,
    CAST(1 AS BIGINT)   AS next,
    address             AS fragment
  FROM
    test

  UNION ALL

  SELECT
    parts.address,
    parts.length,
    last_space.pos,
    parts.next + COALESCE(last_space.pos, @chars),
    SUBSTRING(parts.address, parts.next, COALESCE(last_space.pos - 1, @chars))
  FROM
    parts
  CROSS APPLY
  (
    SELECT
      @chars + 2
      -
      NULLIF(
        CHARINDEX(
          ' ',
          REVERSE(
            SUBSTRING(
              parts.address + ' ',
              parts.next,
              @chars + 1
            )
          )
        )
        , 0
      )
  )
    last_space(pos)
  WHERE
    parts.next <= parts.length
)

SELECT
  *, len(fragment) AS chars
FROM
  parts
WHERE
  next > 1
ORDER BY
  address,
  next

https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=acd11f2bc73e5036bd82498ecf14b08f

Including a version that only calls `REVERSE()` once at the start of the process : https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=bb17deb9003e746e81d0cd12cfd1f8da — MatBailie, Aug 30 '21 at 14:53

Lajos Arpad · Answer 2 · 2021-08-30T13:24:39.707

0

This is a pure SQL Server approach, but I recommend the use of technologies outside SQL Server, because you will have a more efficient approach and a better time while you maintain your code in the future. However, without further ado, let's see this, explaining step-by-step each concept:

First Step

You can create a temporary table which has an id and a content field.

Second Step

You can split your string via STRING_SPLIT, which will return a table for you.

EDIT: As Larnu pointed out in the comment section, unfortunately SPLIT_STRING does not guarantee order of appearance, see Result order of string_split?

An alternative is needed, like OPENJSON.

Third Step

Look into how you can iterate a table.

Fourth Step

Now that you have everything you need, you will need to

create the temporary table
split the string
iterate the result of the split
- load the length of content of the newest record (maximum id) into a variable, default to 0
- if the value you just loaded + the length of the current string + space > maxlength
  - then insert a new record with your value as content
  - else update the last record, by concatenating space and your current string to its previous value

edited Aug 30 '21 at 13:24

answered Aug 30 '21 at 12:22

Lajos Arpad

64,414
37
100
175

3

*Careful* with `STRING_SPLIT`, it makes no guarantee to maintain the order of the values, so it could easily end up putting values in the wrong palce. – Thom A Aug 30 '21 at 12:22
@Larnu you are correct, it's tricky. But not unsolvable. Example: `SELECT value, row_number() OVER (ORDER BY current_timestamp) AS row FROM String_Split('your,string,here',',') d`, see https://dba.stackexchange.com/questions/207274/string-split-and-ordered-results/233480 – Lajos Arpad Aug 30 '21 at 13:16
3

That doesn't actually solve the problem though; you're just ordering the data by an arbitrary value. Any data ordered by an arbitrary value can be returned in *any* arbitrary order. The only way to guarantee the order is with an explicit `ORDER BY` that puts the values into a distinct order, and `STRING_SPLIT` doesn't provide an ordinal position value to be able to do that; something that has been asked for since it was added to the 2016 RC. – Thom A Aug 30 '21 at 13:19

score 0 · Answer 3 · answered Aug 30 '21 at 13:24

You can solve this wuth a recursive CTE and xml path. I split the code into several CTEs for better readability, but I guess you might thinn this out to only the required ones. However, the result should match your expected result:

DECLARE @FullAddress VARCHAR(MAX)= '216 Apartment123 AreaArea SampleWord1 Word2 MiddleTown Upper1Location Another5 NewYork'
DECLARE @MaxLen INT = 30;

WITH cte AS(
SELECT 1 rn, CHARINDEX(' ', @FullAddress) spacePos, LEFT(@FullAddress, CHARINDEX(' ', @FullAddress)) SplitLeft,  RIGHT(@FullAddress, LEN(@FullAddress) - CHARINDEX(' ', @FullAddress)) SplitRight
UNION ALL
SELECT rn+1, CHARINDEX(' ', SplitRight), LEFT(SplitRight, CHARINDEX(' ', SplitRight)), RIGHT(SplitRight, LEN(SplitRight) - CHARINDEX(' ', SplitRight))
  FROM cte
  WHERE CHARINDEX(' ', SplitRight) > 0
UNION ALL
SELECT rn+1, LEN(SplitRight), SplitRight, NULL
  FROM cte
  WHERE CHARINDEX(' ', SplitRight) = 0

),
cte2 AS(
SELECT *, SUM(spacePos) OVER (ORDER BY rn ROWS UNBOUNDED PRECEDING) s, SUM(spacePos) OVER (ORDER BY rn ROWS UNBOUNDED PRECEDING)/@MaxLen s1
  FROM cte
),
cte3 AS(
SELECT s1 + (SUM(spacePos) OVER (PARTITION BY s1 ORDER BY rn ROWS UNBOUNDED PRECEDING)/@MaxLen) s, SplitLeft, rn
  FROM cte2
)
SELECT DISTINCT s, (SELECT c2.SplitLeft AS [text()]
             FROM cte3 c2
             WHERE c2.s = c1.s
             ORDER BY rn
             FOR XML PATH ('')
        ) AS AdressLine
  FROM cte3 c1
  ORDER BY s

Fails on phrases that are exactly 30 characters long, or 'words' that are longer than 30 characters long; https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=2889104e06bdd9cb635f6f54597dc0de — MatBailie, Aug 30 '21 at 15:09

sacse · Answer 4 · 2021-08-30T13:42:32.210

You can maintain the ordering using OPENJOSN() and read the distinct values from temporary table and insert into the final table like below:

declare @FullAddress VARCHAR(MAX)= '216 Apartment123 AreaArea SampleWord1 Word2 MiddleTown Upper1Location Another5 NewYork'

drop table if exists #temp

select convert(int, [key]) [key], [value] strng into #temp
from openjson(concat('["', replace(@FullAddress, ' ', '","'), '"]'))
order by convert(int, [key])

declare @str varchar(100)='', @newstr varchar(30)='', @key int = 0
declare @final_table table (final_strings varchar(30))

while ((select count(1) from #temp) > 0)
begin
    set @newstr = (select top 1 isnull(strng, '') from #temp order by [key])
        
    if len(@str + ' ' + @newstr) > 30 and @str <> ''
    begin
        insert into @final_table select ltrim(@str)
        set @str = @newstr
    end
    else
        set @str = @str + ' ' + @newstr
    
    delete #temp where [key] = @key
    set @key = @key + 1
    
end
if @str <> ''
insert into @final_table select @str

select final_strings, len(final_strings) string_lenght from @final_table

Please see the db<>fiddle here.

This breaks if you ever have a 'word' which is longer than 30 characters; https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=f8b1c582d9b42e098188d7f068d6e2c2 — MatBailie, Aug 30 '21 at 15:07
@MatBailie you are right, but I think OP has no word with a length greater than 30 because the final output column has a max length of 30 with no incomplete word. — sacse, Aug 30 '21 at 15:27
That's an almighty assumption for any production code, especially in the absence of constraints, and when there are are options to cater for the possibility without losing data. — MatBailie, Aug 30 '21 at 15:33

Shiju Shaji · Answer 5 · 2021-08-31T04:58:00.687

Here i am using a SPLIT function (UDF) to split the sting. You can use any split functions available.
SPLIT Function :-

CREATE OR ALTER FUNCTION [dbo].[Split]  
(      
    @RowData NVARCHAR(MAX),  
    @Delimeter NVARCHAR(MAX)  
)  
RETURNS @RtnValue TABLE   
(  
    ID INT IDENTITY(1,1),  
    Data NVARCHAR(MAX)  
)   
AS  
BEGIN   
    Declare @xml xml

    SET @RowData=Replace(Replace(@RowData,'&','&amp;'),'<','&lt;')

    SET @xml = N'<root><r>' + Replace(@RowData, @Delimeter,'</r><r>') + '</r></root>'

    Insert into @RtnValue(Data)
    SELECT  t.value('.', 'nvarchar(max)') AS [Value] FROM @XML.nodes('//root/r') AS a(t)
  
    RETURN  
END

YOUR ANSWER

 DECLARE @FullAddress VARCHAR(MAX) = '216 Apartment123 AreaArea SampleWord1 Word2 MiddleTown Upper1Location Another5 NewYork Zakbrown'
    
SELECT id
    ,Data
FROM SPLIT(@FullAddress, ' ')

DECLARE @table TABLE (
    id INT
    ,datas VARCHAR(50)
    )
DECLARE @data VARCHAR(50)
DECLARE @ID INT
DECLARE @datas VARCHAR(50)
DECLARE @i INT = 1

DECLARE db_cursor CURSOR
FOR
SELECT id
    ,Data
FROM SPLIT(@FullAddress, ' ')

OPEN db_cursor

FETCH NEXT
FROM db_cursor
INTO @id
    ,@data

SET @datas = ''

WHILE @@FETCH_STATUS = 0
BEGIN
    --select @data,Len(@datas)
    IF Len(@datas + ' ' + @data) <= 30
    BEGIN
        SET @datas = @datas + ' ' + @data
    END
    ELSE
    BEGIN
        INSERT INTO @table (
            id
            ,[datas]
            )
        VALUES (
            @i
            ,@datas
            )

        SET @i = @i + 1
        SET @datas = @data
    END

    FETCH NEXT
    FROM db_cursor
    INTO @id
        ,@data
END

CLOSE db_cursor

DEALLOCATE db_cursor

SELECT datas
FROM @table

UNION ALL

SELECT @datas

 OUTPUT
   
  216 Apartment123 AreaArea
  SampleWord1 Word2 MiddleTown
  Upper1Location Another5
  NewYork Zakbrown

This outputs all the words in the wrong order : https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=1f8a98cf505e68eb92700ef63a301340 *(the first line of output **should** be `216 Apartment123 AreaArea` and **not** `216 Another5 Apartment123`)* — MatBailie, Aug 30 '21 at 15:18

Sylver95 · Answer 6 · 2022-06-15T07:57:08.820

I've create a function to do this :

CREATE FUNCTION [dbo].[fnSplitSentence]
(   
   @fullText varchar(max),
   @maxLength int
)
RETURNS @sentences table (
    sentence varchar(max)
)
AS
BEGIN

declare @words table (
    word varchar(max),
    length int
)

SET @fullText = TRIM(@fullText) + ' '

declare @word varchar(max) = '',
        @length int = 0,
        @char varchar(1),
        @i int = 1,
        @len int = len(@fullText) + 1

while @i <= @len
begin 

    set @char = SUBSTRING(@fullText, @i, 1)
    
    if @char = ' '
    begin

        insert into @words (word, length) values (@word, @length)
        set @word = ''
        set @length = 0

    end
    else
    begin

        set @word += @char
        set @length += 1

    end

    set @i += 1

end

declare @sentence varchar(max) = ''

DECLARE CUR CURSOR FOR SELECT word, length FROM @words

open CUR
FETCH NEXT FROM CUR INTO @word, @length

WHILE @@FETCH_STATUS = 0
begin
    
    --print(@word)
    if LEN(@sentence) + LEN(@word) + 1 > @maxLength
    begin
        insert into @sentences (sentence) values (@sentence)
        set @sentence = @word + ' '
    end
    else
        
        set @sentence += @word + ' '

    FETCH NEXT FROM CUR INTO @word, @length

end

insert into @sentences (sentence) values (@sentence)

CLOSE CUR
DEALLOCATE CUR

RETURN;

END

Usage :

select * from fnSplitSentence('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore', 12)

True, but that answers the question perfectly. The original question does not talk about efficiency. This script can be used for procedures called occasionally. But it is true that it should not be used to process a large number of addresses. If you have a better idea I'm very interested — Sylver95, Jun 20 '22 at 10:35
My "better" idea (subjective opinion) is the answer I provided above. — MatBailie, Jun 20 '22 at 10:41

Split String into given length but do not split word

6 Answers6

First Step

Second Step

Third Step

Fourth Step

Linked