1

Having the following table (conversations):

 id | record_id  |  is_response  |         text         |
 ---+------------+---------------+----------------------+
 1  |     1      |      false    | in text 1            |
 2  |     1      |      true     | response text 3      |
 3  |     1      |      false    | in text 2            |
 4  |     1      |      true     | response text 2      |
 5  |     1      |      true     | response text 3      |
 6  |     2      |      false    | in text 1            |
 7  |     2      |      true     | response text 1      |
 8  |     2      |      false    | in text 2            |
 9  |     2      |      true     | response text 3      |
 10 |     2      |      true     | response text 4      |

And another help table (responses):

 id |         text         |
 ---+----------------------+
 1  | response text 1      |
 2  | response text 2      |
 3  | response text 4      |

I'm looking for an SQL query to output the following:

  record_id |       context
  ----------+-----------------------+---------------------
       1    | in text 1 response text 3 in text 2 response text 2
  ----------+-----------------------+---------------------
       2    | in text 1 response text 1
  ----------+-----------------------+---------------------
       2    | in text 2 response text 3 response text 4

So each time is_response is true and the text is in the responses table, aggregate the conversation context up to this point, ignoring conversation part that does not end with a response in the pool.

In the example above living response text 3 in record_id 1.

I've tried the following complex SQL but it breaks sometimes aggregating the text wrong:

with context as(
    with answers as (

       SELECT record_id, is_response, id as ans_id
        , max(id)
          OVER (PARTITION BY record_id ORDER BY id
          ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS previous_ans_id
       FROM (select * from conversations where text in (select text from responses)) ans
       ),
     lines as (
      select answers.record_id, con.id, COALESCE(previous_ans_id || ',' || ans_id, '0') as block, con.text as text from answers, conversations con where con.engagement_id = answers.record_id and ((previous_ans_id is null and con.id <= ans_id) OR (con.id > previous_ans_id and con.id <= ans_id)) order by engagement_id, id asc
      )

      select record_id, block,replace(trim(both ' ' from string_agg(text, E' ')) ,'  ',' ') ctx from lines group by record_id, block order by record_id,block
      )

select * from context

I'm sure there is a better way.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Shlomi Schwartz
  • 8,693
  • 29
  • 109
  • 186

2 Answers2

1

Here's my take:

SELECT
    record_id,
    string_agg(text, ' ' ORDER BY id) AS context
FROM (
    SELECT
        *,
        coalesce(sum(incl::integer) OVER (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING),0) AS grp
    FROM (
        SELECT *, is_response AND text IN (SELECT text FROM responses) as incl
        FROM conversations
         ) c
     ) c1
GROUP BY record_id, grp
HAVING bool_or(incl)
ORDER BY max(id);

This will scan the table conversations once, but I am not sure if it will perform better than your solution. The basic idea is to use a window function to count how maybe preceding rows within the same record, end the conversation. Then we can group by with that number and the record_id and discard incomplete conversations.

redneb
  • 21,794
  • 6
  • 42
  • 54
  • Thanks for the reply, unfortunately in your solution the aggregated text does not preserve the ordering of the lines, and I get the text and responses mixed up – Shlomi Schwartz Sep 20 '16 at 07:48
  • @ShlomiSchwartz I added an `ODER BY` in the aggregate, it should be ok now. – redneb Sep 20 '16 at 08:08
  • Works like a charm. Wasn't familiar with ordering within the aggregate function. Thank you. – Shlomi Schwartz Sep 20 '16 at 09:03
  • What if I wanted to change the query, so when incl field is true in sequence (F,F,F,T,T,T) the text will be aggregated in the same group and not as a stand alone row? – Shlomi Schwartz Sep 20 '16 at 10:48
  • @ShlomiSchwartz I am not sure I understand what you mean. Do you want to do the aggregation only in places where incl is true _and_ the record that follows has incl=false? – redneb Sep 20 '16 at 13:42
  • your query works great, except when there more than one valid response in a row. While the desired output is to join them, this query gives 2 groups. In the example above there is the desired output for this scenario (`in text 2 response text 3 response text 4`). – Shlomi Schwartz Sep 21 '16 at 08:06
0

There is a simple and fast solution:

SELECT record_id, string_agg(text, ' ') As context
FROM  (
   SELECT c.*, count(r.text) OVER (PARTITION BY c.record_id ORDER BY c.id DESC) AS grp
   FROM   conversations  c
   LEFT   JOIN responses r ON r.text = c.text AND c.is_response
   ORDER  BY record_id, id
   ) sub
WHERE  grp > 0  -- ignore conversation part that does not end with a response
GROUP  BY record_id, grp
ORDER  BY record_id, grp;

count() only counts non-null values. r.text is NULL if the LEFT JOIN to responses comes up empty:

The value in grp (short for "group") is only increased when a new output row is triggered. All rows belonging to the same output row end up with the same grp number. It's then easy to aggregate in the outer SELECT.

The special trick is to count conversation ends in reverse order. Everything after the last end (coming first when starting from the end) gets grp = 0 and is removed in the outer SELECT.

Similar cases with more explanation:

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228