2

I want to concatenate the surrounding rows(in the following examples only the surrounding 2 rows) after ranking to a new column(group by seems to not work), and here is the data I have:

Schema (MySQL v8.0)

CREATE TABLE log_table (
  `user_id` VARCHAR(5),
  `date_time` DATETIME,
  `event_name` VARCHAR(10),
  `trivial` int
);

INSERT INTO log_table
  (`user_id`, `date_time`, `event_name`, `trivial`)
VALUES
  ('001', '2020-12-10 10:00:02', 'c', 3),
  ('001', '2020-12-10 10:00:01', 'b', 9),
  ('001', '2020-12-10 10:00:40', 'e', 2),
  ('001', '2020-12-10 10:00:20', 'd', 6),
  ('001', '2020-12-10 10:00:00', 'a', 1),
  ('002', '2020-12-10 10:00:10', 'C', 9),
  ('002', '2020-12-10 10:00:50', 'D', 0),
  ('002', '2020-12-10 10:00:02', 'A', 2),
  ('002', '2020-12-10 10:00:09', 'B', 4);

To illustrate what I want to do. I can do summing over numerical values using the sum clause as follows:
Query #1

SELECT *,
       SUM(trivial)
         over(
           PARTITION BY user_id
           ORDER BY user_id, date_time ROWS BETWEEN 2 preceding AND 2 following)
       AS
       trivial_new
FROM   log_table; 
user_id date_time event_name trivial trivial_new
001 2020-12-10 10:00:00 a 1 13
001 2020-12-10 10:00:01 b 9 19
001 2020-12-10 10:00:02 c 3 21
001 2020-12-10 10:00:20 d 6 20
001 2020-12-10 10:00:40 e 2 11
002 2020-12-10 10:00:02 A 2 15
002 2020-12-10 10:00:09 B 4 15
002 2020-12-10 10:00:10 C 9 15
002 2020-12-10 10:00:50 D 0 13

View on DB Fiddle

For the string field event_name, I tried this snippet:
Query #2

SELECT *,
       Concat(event_name)
         over(
           PARTITION BY user_id
           ORDER BY user_id, date_time ROWS BETWEEN 2 preceding AND 2 following)
       AS
       event_name_new
FROM   log_table

And here is my expected results:

user_id date_time event_name trivial event_name_new
001 2020-12-10 10:00:00 a 1 abc
001 2020-12-10 10:00:01 b 9 abcd
001 2020-12-10 10:00:02 c 3 abcde
001 2020-12-10 10:00:20 d 6 bcde
001 2020-12-10 10:00:40 e 2 cde
002 2020-12-10 10:00:02 A 2 ABC
002 2020-12-10 10:00:09 B 4 ABCD
002 2020-12-10 10:00:10 C 9 ABCD
002 2020-12-10 10:00:50 D 0 BCD

But the Query #2 cannot get me here, and I have googled but all I can find is about group by(refer to this and this and this).

I know I can work around the problem by using LAG and LEAD(for the following rows) but I need to concatenate the new columns and when I need to concatenate many rows I need to do lots of manual work like concatenate them by separators like , and etc.

Can I do that in one step without using LAG and LEAD?

Lerner Zhang
  • 6,184
  • 2
  • 49
  • 66
  • With the sample data and expected output you have provided, it's not clear to me what's wrong with using `LAG`? – Nick Jan 23 '21 at 04:50
  • @Nick If it is possible for `sum` to do that without `LAG` I wonder if it would be possible for concatenating strings? – Lerner Zhang Jan 23 '21 at 05:41
  • @Nick For instance I want to concatenate the surrounding 5 rows(5 rows proceeding that and 5 following that row) to the new column (how) can I do that neatly? – Lerner Zhang Jan 23 '21 at 05:44
  • 1
    Unfortunately there is no way to specify a window function for string concatenation... Can you edit your question with some more complicated expected output data? (e.g. 2 before, 1 after), that will make it easier to think up a solution – Nick Jan 23 '21 at 05:52
  • @Nick Please check the updated question. – Lerner Zhang Jan 23 '21 at 06:01

2 Answers2

3

You can solve this problem by using CTEs: first computing row numbers for each user_id, ordered by date_time; then JOINing the table to itself based on the row number being within the minimum/maximum row number range for that row (which is from row number - before to row number + after). Then you can just GROUP_CONCAT the event_name field from the JOINed table:

SET @before := 2;
SET @after := 2;

WITH rns AS (
  SELECT *, 
         CAST(ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY date_time) AS SIGNED) AS rn
  FROM log_table
)
SELECT r1.user_id, r1.date_time, r1.event_name,
       GROUP_CONCAT(r2.event_name SEPARATOR '') AS event_name_new
FROM rns r1
JOIN rns r2 ON r2.user_id = r1.user_id
           AND r2.rn BETWEEN r1.rn - @before AND r1.rn + @after
GROUP BY r1.user_id, r1.date_time, r1.event_name
ORDER BY r1.user_id, r1.rn

Output (for 2 before and 2 after):

user_id     date_time               event_name  event_name_new
001         2020-12-10 10:00:00     a           abc
001         2020-12-10 10:00:01     b           abcd
001         2020-12-10 10:00:02     c           abcde
001         2020-12-10 10:00:20     d           bcde
001         2020-12-10 10:00:40     e           cde
002         2020-12-10 10:00:02     A           ABC
002         2020-12-10 10:00:09     B           ABCD
002         2020-12-10 10:00:10     C           ABCD
002         2020-12-10 10:00:50     D           BCD

Demo on db-fiddle

Nick
  • 138,499
  • 22
  • 57
  • 95
  • It's too complicated for me, and let me spend some time digesting it. Could you please do some explanation? For instance, some comments in the code. – Lerner Zhang Jan 23 '21 at 07:01
  • 1
    @LernerZhang take a look at https://www.db-fiddle.com/f/5dKasZPNK7SqjnXvsj7qqX/13, it shows the outputs of the intermediate CTEs as well as what you get from the final query if you take away the grouping. That might help... – Nick Jan 23 '21 at 07:05
  • @LernerZhang I have to go... please leave any questions in comments and I'll answer as soon as I can. – Nick Jan 23 '21 at 07:06
  • 1
    @LernerZhang Having slept on it I realised the query can be simplified. Please see my edit. – Nick Jan 23 '21 at 22:19
  • It seems that the `r1.rn` in the last line and the `COUNT(*) OVER (PARTITION BY user_id) AS numrows` can be omitted. Am I right? – Lerner Zhang Jan 25 '21 at 16:22
  • 1
    @LernerZhang yes, the `COUNT(*)` is no longer required, that was part of the original query which was no longer needed after my edit. I've removed it from the answer. You do need the `r1.rn` in the `ORDER BY` clause, otherwise the rows won't come out ordered by `date_time` – Nick Jan 25 '21 at 21:51
  • I found that CTE with grouping is much slower than LAG and LEAD. – Lerner Zhang Jan 30 '21 at 01:36
  • @LernerZhang that's definitely possible, but using LAG and LEAD doesn't give you the flexibility to pick a number of rows to group over. However if you have a better solution, you can always answer your own question. – Nick Jan 30 '21 at 07:09
3

A correlated subquery might be the simplest solution:

with l as (
      select l.*,
             cast(row_number() over (partition by user_id order by date_time) as signed) as seqnum
      from log_table l
     )
select l.*,
       (select group_concat(l2.event_name order by l2.date_time separator '')
        from l l2
        where l2.user_id = l.user_id and
              l2.seqnum between l.seqnum - 2 and l.seqnum + 2
       ) as new_event_name
from l;

If the event names are one character, you can eliminate the correlated subquery and use string operations:

with l as (
      select l.*, full_concat,
             cast(row_number() over (partition by user_id order by date_time) as signed) as seqnum
      from log_table l join
           (select user_id,  group_concat(event_name order by date_time separator '') as full_concat
            from log_table l
            group by user_id
           ) ll
           using (user_id)
     )
select l.*, substr(full_concat, greatest(seqnum - 2, 1), least(5, seqnum + 2))
from l;

Here is a db<>fiddle.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786