1

What I have is a chat from a telecom company signing up new customers. In the chat, a customer and company representative have a chat.

I am trying to collapse the chat, so that there are less rows. The diagrams below show how the before data looks like and how the after data needs to look like.

BEFORE

Dataset right now

AFTER

What I would like to see

I have looked at the following articles:

I have tried this code:

select 
    unique_id, string_agg(concat(text, ' ', text), ', ')
from 
    conversation
group by 
    unique_id, user

However, this does not collapse it as necessary. It collapses it entirely to 2x lines, one for customer and another for company. The logic I am looking for is If the next row in this query contains the same unique_id, user Then concatenate the current row text field with the next row text field.

Here is the SQL Fiddle page, but I am running this code in SQL Server which has string_agg: http://sqlfiddle.com/#!9/5ad86c/3

I have requested an almost similar algorithm in R, if you look at my StackOverflow history.

CREATE TABLE conversation
(
     `unique_id` double, 
     `line_no` int, 
     `user` varchar(7000), 
     `text` varchar(7000)
);

INSERT INTO conversation (`unique_id`, `line_no`, `user`, `text`)
VALUES
    (50314585222, 1, 'customer', 'Hi I would like to sign up for a service'),
    (50314585222, 2, 'company', 'Hi My name is Alex. We can offer the following plans. We also have signup bonuses, with doubling of data for 12 months '),
    (50314585222, 3, 'company', 'Plan1: 40GB data, with monthly price of $80'),
    (50314585222, 4, 'company', 'Plan2: 20GB data, with monthly price of $40'),
    (50314585222, 5, 'company', 'Plan3: 5GB data, with monthly price of $15'),
    (50314585222, 6, 'customer', 'I was hoping for a much smaller plan, with only voice service'),
    (50314585222, 7, 'customer', 'maybe the $10 per month plan.'),
    (50319875222, 4, 'customer', 'so how do I sign up'),
    (50319875222, 5, 'customer', '*for the service'),
    (50319875222, 7, 'company', 'maybe I can call you for your details?')
;
GMB
  • 216,147
  • 25
  • 84
  • 135
treeof
  • 63
  • 8
  • You need to have value(s) in your table which logically tell us which rows should be collapsed together. – Tim Biegeleisen Apr 13 '20 at 06:54
  • Agreed, but this is where I am tripping up. The logic is If the next row in this query contains the same unique_id, user Then concatenate the current row text field with the next row text field. – treeof Apr 13 '20 at 06:57
  • No, that isn't your logic, because if it were, then the 10 records you showed us would collapse to just one line. Please tell us the logic used to collapse the lines. – Tim Biegeleisen Apr 13 '20 at 06:59
  • Correct. I don't know what I have to do to get there. – treeof Apr 13 '20 at 07:07
  • You need another column which keeps track of the sub-threads. Then, you would aggregate both by `unique_id` and this other column, to generate the rollups you want to see. – Tim Biegeleisen Apr 13 '20 at 07:12
  • Are you shure to be on a SQL Server RDBMS ? Because the code you put in this post seems to be a MySQL one : 1. Accents to delimit SQL Identifier is proper to MySQL and does not exists in SQL Server nor in the ISO SQL Standard 2. double datatypedoes not exists in SQL Server, but exists in MySQL – SQLpro Apr 13 '20 at 07:38

2 Answers2

1

If I understand you correctly, the next approach is a possible solution. You need to find the changes and define the appropriate groups:

Table:

CREATE TABLE [conversation]
(
     [unique_id] bigint, 
     [line_no] int, 
     [user] varchar(7000), 
     [text] varchar(7000)
);

INSERT INTO [conversation] ([unique_id], [line_no], [user], [text])
VALUES
    (50314585222, 1, 'customer', 'Hi I would like to sign up for a service'),
    (50314585222, 2, 'company', 'Hi My name is Alex. We can offer the following plans. We also have signup bonuses, with doubling of data for 12 months '),
    (50314585222, 3, 'company', 'Plan1: 40GB data, with monthly price of $80'),
    (50314585222, 4, 'company', 'Plan2: 20GB data, with monthly price of $40'),
    (50314585222, 5, 'company', 'Plan3: 5GB data, with monthly price of $15'),
    (50314585222, 6, 'customer', 'I was hoping for a much smaller plan, with only voice service'),
    (50314585222, 7, 'customer', 'maybe the $10 per month plan.'),
    (50319875222, 4, 'customer', 'so how do I sign up'),
    (50319875222, 5, 'customer', '*for the service'),
    (50319875222, 7, 'company', 'maybe I can call you for your details?')
;

Statement:

; WITH ChangesCTE AS (
    SELECT 
        *,
        LAG([user]) OVER (PARTITION BY [unique_id] ORDER BY [line_no]) AS prev_user
    FROM [conversation]
), GroupsCTE AS (
    SELECT 
        *,
        SUM(CASE WHEN [user] <> [prev_user] OR [prev_user] IS NULL THEN 1 ELSE 0 END) OVER (PARTITION BY [unique_id] ORDER BY [line_no]) AS [group_id]
    FROM ChangesCTE
)
SELECT 
    [unique_id], 
    MIN([line_no]) AS [line_no], 
        MIN([user]) AS [user], 
        STRING_AGG([text], ' ') WITHIN GROUP (ORDER BY [line_no]) AS [text]
FROM GroupsCTE
GROUP BY [unique_id], [group_id]
ORDER BY [unique_id]

Result:

unique_id   line_no user        text
50314585222 1       customer    Hi I would like to sign up for a service
50314585222 2       company     Hi My name is Alex. We can offer the following plans. We also have signup bonuses, with doubling of data for 12 months  Plan1: 40GB data, with monthly price of $80 Plan2: 20GB data, with monthly price of $40 Plan3: 5GB data, with monthly price of $15
50314585222 6       customer    I was hoping for a much smaller plan, with only voice service maybe the $10 per month plan.
50319875222 4       customer    so how do I sign up *for the service
50319875222 7       company     maybe I can call you for your details?
Zhorov
  • 28,486
  • 6
  • 27
  • 52
0

This is a gaps-and-island problems, where you want to group together adjacent rows by the same speaker.

For this to be solvable, you need a column to order the records. It seems we cannot use line_no, that has duplicate values in the same conversation. I still assumed that such column exists, and is called ordering_col.

select
    unique_id,
    min(line_no) line_no,
    user,
    string_agg(text) within group(order by ordering_id) text
from (
    select 
        t.*,
        row_number() over(partition by unique_id order by ordering_id) rn1,
        row_number() over(partition by unique_id, user order by ordering_id) rn2
    from mytable t
) t
group by unique_id, user, rn1 - rn2
order by unique_id, min(ordering_id)
GMB
  • 216,147
  • 25
  • 84
  • 135
  • I have created a separate simple unique index as a primary key. The resultset is the same as mentioned in the question - select unique_id, string_agg(concat(text, ' ', text), ', ') from conversation group by unique_id, user It is an interesting answer though, but the other one answers it point blank so have marked it as an answer. Many thanks. – treeof Apr 14 '20 at 03:05