String Concat over partition in SQL?

Question

I want to concatenate the surrounding rows(in the following examples only the surrounding 2 rows) after ranking to a new column(group by seems to not work), and here is the data I have:

Schema (MySQL v8.0)

CREATE TABLE log_table (
  `user_id` VARCHAR(5),
  `date_time` DATETIME,
  `event_name` VARCHAR(10),
  `trivial` int
);

INSERT INTO log_table
  (`user_id`, `date_time`, `event_name`, `trivial`)
VALUES
  ('001', '2020-12-10 10:00:02', 'c', 3),
  ('001', '2020-12-10 10:00:01', 'b', 9),
  ('001', '2020-12-10 10:00:40', 'e', 2),
  ('001', '2020-12-10 10:00:20', 'd', 6),
  ('001', '2020-12-10 10:00:00', 'a', 1),
  ('002', '2020-12-10 10:00:10', 'C', 9),
  ('002', '2020-12-10 10:00:50', 'D', 0),
  ('002', '2020-12-10 10:00:02', 'A', 2),
  ('002', '2020-12-10 10:00:09', 'B', 4);

To illustrate what I want to do. I can do summing over numerical values using the sum clause as follows:
Query #1

SELECT *,
       SUM(trivial)
         over(
           PARTITION BY user_id
           ORDER BY user_id, date_time ROWS BETWEEN 2 preceding AND 2 following)
       AS
       trivial_new
FROM   log_table;

user_id	date_time	event_name	trivial	trivial_new
001	2020-12-10 10:00:00	a	1	13
001	2020-12-10 10:00:01	b	9	19
001	2020-12-10 10:00:02	c	3	21
001	2020-12-10 10:00:20	d	6	20
001	2020-12-10 10:00:40	e	2	11
002	2020-12-10 10:00:02	A	2	15
002	2020-12-10 10:00:09	B	4	15
002	2020-12-10 10:00:10	C	9	15
002	2020-12-10 10:00:50	D	0	13

View on DB Fiddle

For the string field event_name, I tried this snippet:
Query #2

SELECT *,
       Concat(event_name)
         over(
           PARTITION BY user_id
           ORDER BY user_id, date_time ROWS BETWEEN 2 preceding AND 2 following)
       AS
       event_name_new
FROM   log_table

And here is my expected results:

user_id	date_time	event_name	trivial	event_name_new
001	2020-12-10 10:00:00	a	1	abc
001	2020-12-10 10:00:01	b	9	abcd
001	2020-12-10 10:00:02	c	3	abcde
001	2020-12-10 10:00:20	d	6	bcde
001	2020-12-10 10:00:40	e	2	cde
002	2020-12-10 10:00:02	A	2	ABC
002	2020-12-10 10:00:09	B	4	ABCD
002	2020-12-10 10:00:10	C	9	ABCD
002	2020-12-10 10:00:50	D	0	BCD

But the Query #2 cannot get me here, and I have googled but all I can find is about group by(refer to this and this and this).

I know I can work around the problem by using LAG and LEAD(for the following rows) but I need to concatenate the new columns and when I need to concatenate many rows I need to do lots of manual work like concatenate them by separators like , and etc.

Can I do that in one step without using LAG and LEAD?

With the sample data and expected output you have provided, it's not clear to me what's wrong with using `LAG`? — Nick, Jan 23 '21 at 04:50
@Nick If it is possible for `sum` to do that without `LAG` I wonder if it would be possible for concatenating strings? — Lerner Zhang, Jan 23 '21 at 05:41
@Nick For instance I want to concatenate the surrounding 5 rows(5 rows proceeding that and 5 following that row) to the new column (how) can I do that neatly? — Lerner Zhang, Jan 23 '21 at 05:44
Unfortunately there is no way to specify a window function for string concatenation... Can you edit your question with some more complicated expected output data? (e.g. 2 before, 1 after), that will make it easier to think up a solution — Nick, Jan 23 '21 at 05:52

Nick · Answer 1 · 2021-01-25T21:50:25.283

3

You can solve this problem by using CTEs: first computing row numbers for each user_id, ordered by date_time; then JOINing the table to itself based on the row number being within the minimum/maximum row number range for that row (which is from row number - before to row number + after). Then you can just GROUP_CONCAT the event_name field from the JOINed table:

SET @before := 2;
SET @after := 2;

WITH rns AS (
  SELECT *, 
         CAST(ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY date_time) AS SIGNED) AS rn
  FROM log_table
)
SELECT r1.user_id, r1.date_time, r1.event_name,
       GROUP_CONCAT(r2.event_name SEPARATOR '') AS event_name_new
FROM rns r1
JOIN rns r2 ON r2.user_id = r1.user_id
           AND r2.rn BETWEEN r1.rn - @before AND r1.rn + @after
GROUP BY r1.user_id, r1.date_time, r1.event_name
ORDER BY r1.user_id, r1.rn

Output (for 2 before and 2 after):

user_id     date_time               event_name  event_name_new
001         2020-12-10 10:00:00     a           abc
001         2020-12-10 10:00:01     b           abcd
001         2020-12-10 10:00:02     c           abcde
001         2020-12-10 10:00:20     d           bcde
001         2020-12-10 10:00:40     e           cde
002         2020-12-10 10:00:02     A           ABC
002         2020-12-10 10:00:09     B           ABCD
002         2020-12-10 10:00:10     C           ABCD
002         2020-12-10 10:00:50     D           BCD

Demo on db-fiddle

edited Jan 25 '21 at 21:50

answered Jan 23 '21 at 06:21

Nick

138,499
22
57
95

It's too complicated for me, and let me spend some time digesting it. Could you please do some explanation? For instance, some comments in the code. – Lerner Zhang Jan 23 '21 at 07:01
1

@LernerZhang take a look at https://www.db-fiddle.com/f/5dKasZPNK7SqjnXvsj7qqX/13, it shows the outputs of the intermediate CTEs as well as what you get from the final query if you take away the grouping. That might help... – Nick Jan 23 '21 at 07:05
@LernerZhang I have to go... please leave any questions in comments and I'll answer as soon as I can. – Nick Jan 23 '21 at 07:06
1

@LernerZhang Having slept on it I realised the query can be simplified. Please see my edit. – Nick Jan 23 '21 at 22:19
It seems that the `r1.rn` in the last line and the `COUNT(*) OVER (PARTITION BY user_id) AS numrows` can be omitted. Am I right? – Lerner Zhang Jan 25 '21 at 16:22
1

@LernerZhang yes, the `COUNT(*)` is no longer required, that was part of the original query which was no longer needed after my edit. I've removed it from the answer. You do need the `r1.rn` in the `ORDER BY` clause, otherwise the rows won't come out ordered by `date_time` – Nick Jan 25 '21 at 21:51
I found that CTE with grouping is much slower than LAG and LEAD. – Lerner Zhang Jan 30 '21 at 01:36
@LernerZhang that's definitely possible, but using LAG and LEAD doesn't give you the flexibility to pick a number of rows to group over. However if you have a better solution, you can always answer your own question. – Nick Jan 30 '21 at 07:09

Gordon Linoff · Accepted Answer · 2021-01-23T14:12:46.760

A correlated subquery might be the simplest solution:

with l as (
      select l.*,
             cast(row_number() over (partition by user_id order by date_time) as signed) as seqnum
      from log_table l
     )
select l.*,
       (select group_concat(l2.event_name order by l2.date_time separator '')
        from l l2
        where l2.user_id = l.user_id and
              l2.seqnum between l.seqnum - 2 and l.seqnum + 2
       ) as new_event_name
from l;

If the event names are one character, you can eliminate the correlated subquery and use string operations:

with l as (
      select l.*, full_concat,
             cast(row_number() over (partition by user_id order by date_time) as signed) as seqnum
      from log_table l join
           (select user_id,  group_concat(event_name order by date_time separator '') as full_concat
            from log_table l
            group by user_id
           ) ll
           using (user_id)
     )
select l.*, substr(full_concat, greatest(seqnum - 2, 1), least(5, seqnum + 2))
from l;

Here is a db<>fiddle.

String Concat over partition in SQL?

2 Answers2