0

Hopefully everyone is having a productive lockdown all over the world. This is my second issue I wanted some assistance with today.

What I have is a chat from a telecom company signing up new customers. I have successfully collapsed them into 2x rows per unique_id - a unique chat interaction captured between customer and company agent.

  • I would like to now take each column (text) in each row and separate it out to 5 equal varchar columns.
  • The objective is to splice/chunk a conversation into 5 different stages within this table.
  • I do not have access to delimiters as customers and company staff use delimiting characters themselves so it makes this tricky.

Below I have 2 images with what the data looks like now and what I am looking for.

BEFORE

Current DB image showing text in separate lines

AFTER

What I would like to see when the text column is divided

I have looked at the following articles to try to crack it, but am stuck:

Here is the SQL Fiddle page, but I am running this code in MS SQL Server: http://sqlfiddle.com/#!9/ddd08c

Here is the table creation code:

CREATE TABLE Table1
    (`unique_id` double, `user` varchar(8), `text` varchar(144))
;

INSERT INTO Table1
    (`unique_id`, `user`, `text`)
VALUES
    (50314585222,  'customer', 'This is part 1 of long text. This is part 2 of long text. This is part 3 of long text. This is part 4 of long text. This is part 5 of long text.'),
    (50314585222, 'company', 'This is part 1 of long text This is part 2 of long text This is part 3 of long text This is part 4 of long text This is part 5 of long text'),
    (50319875222,  'customer', 'This is part 1 This is part 2 This is part 3 This is part 4 This is part 5'),
    (50319875222,  'company', 'This is part 1 This is part 2 This is part 3 This is part 4 This is part 5')
;

I have requested an almost similar algorithm in R, in my history. I have been trying to do this in SQL.

treeof
  • 63
  • 8
  • Without any form of delimiter how do you decide where to split the text? – Nick Apr 13 '20 at 08:43
  • I was hoping to divide it by total length of column / 5, and then somehow make sure that entire word makes it through without cutting out in the middle. – treeof Apr 13 '20 at 09:11

1 Answers1

0

I have manage to solve this with the T-SQL statement below:

WITH DataSource AS
(
    SELECT *
          ,'\b.{1,'+CAST(CEILING(LEN([text]) * 1.0 /5) AS VARCHAR(12)) +'}\b' AS [pattern]
    FROM TAble1
), PreparedData AS
(
    SELECT unique_id
          ,[user]
          ,'text' +  CAST(RM.matchID + 1 AS VARCHAR(12)) as [column]
          ,RM.CaptureValue AS [value]
    FROM DataSource T
    CROSS APPLY [dbo].[fn_Utils_RegexMatches] ([text], [pattern]) RM
)
SELECT *
FROM PreparedData DS
PIVOT
(
    max([value]) for [column] IN ([text1], [text2], [text3], [text4], [text5])
) PVT;

enter image description here

In order to use this code, you need to implement SQL CLR function(s) for working with regular expression in the context of T-SQL (you need to invest some time understanding how SQL CLR works) - otherwise, you will not be able to use this solution.

So, having RegexMatches function, the first part is to build a regular expression pattern for splitting the data:

SELECT *
      ,'\b.{1,'+CAST(CEILING(LEN([text]) * 1.0 /5) AS VARCHAR(12)) +'}\b' AS [pattern]
FROM TAble1;

The pattern is \b.number\b and will match part of the strings with length number but not cutting the words (check if boundary works for you, because in some cases it won't).

Then, using our regex matches function we getting a result like this (the second common table expression):

enter image description here

And the data above is ready for pivoting which is pretty easy.

So, the notes are:

  • you need to implement Microsoft String Utility
  • you need to ensure the regex pattern works for you
  • you can split the T-SQL I used, check the other columns of the regex function and even make dynamic pivoting - the code is an example and need to modify/check it before using in production
gotqn
  • 42,737
  • 46
  • 157
  • 243