0

I have a table in ClickHouse with events of connecting and disconnecting to system. Query select timestamp, username, event from table gives following result.

timestamp username event
December 20, 2022, 18:24 1 Connect
December 20, 2022, 18:30 1 Disconnect
December 20, 2022, 18:34 1 Connect
December 21, 2022, 12:07 1 Disconnect
December 20, 2022, 12:15 2 Connect
December 20, 2022, 12:47 2 Disconnect

The session must be show in table as finished by the end of the day. If user was connected to system on 20th December and had no "Disconnect" after that in the same day, I have to add such 'Disconnect' event at the table with some query. And also I have to add row with event of 'Connect' to the table at 00:00 of next day. For example, in sample table you can see that user #1 had not finished session on 20th December, so I want to have the following result:

timestamp username event
December 20, 2022, 18:24 1 Connect
December 20, 2022, 18:30 1 Disconnect
December 20, 2022, 18:34 1 Connect
December 20, 2022, 23:59 1 Disconnect
December 21, 2022, 00:00 1 Connect
December 21, 2022, 12:07 1 Disconnect
December 20, 2022, 12:15 2 Connect
December 20, 2022, 12:47 2 Disconnect

Is there any way to amend the query so it will work as I described above? ClickHouse is not so common as Posrgres or SQL Server, as far as I know, so code in Postgres dialect will be fine, I will find out how to make the same with ClickHouse.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
Daniel G
  • 41
  • 6

2 Answers2

1

First identify these Disconnect events that need to be preceded by a Disconnect/Connect pair. This is t CTE, overnight attribute. Then insert a Disconnect/Connect pair into the_table for every record of t with overnight true.

with t as
(
 select *, 
   "timestamp"::date > lag("timestamp") over (partition by username order by "timestamp")::date overnight 
 from the_table
 where "event" = 'Disconnect'
)
insert into the_table ("timestamp", "username", "event")
 select date_trunc('day', "timestamp") - interval '1 second', "username", 'Disconnect'
   from t where overnight
 union all
 select date_trunc('day', "timestamp"), "username", 'Connect' 
   from t where overnight;

DB-fiddle demo

Stefanov.sm
  • 11,215
  • 2
  • 21
  • 21
  • thank! I'm now trying to do the same in ClickHouse. Is there any chance not to use 'lateral' keyword? Clickhouse does not have this one. – Daniel G Dec 21 '22 at 13:57
  • 1
    I couldn't figure out how to do it w/o `lateral` but maybe [this](https://stackoverflow.com/questions/28550679/what-is-the-difference-between-a-lateral-join-and-a-subquery-in-postgresql) SO thread can shed more light. Or maybe an UDF could help. Or maybe using a procedural language if ClickHouse supports one. – Stefanov.sm Dec 21 '22 at 14:08
  • 1
    It turned out to be much simpler & straightforward w/o `lateral`. Answer updated. – Stefanov.sm Dec 22 '22 at 14:44
1

You do not need lateral join to achieve desired result in Clickhouse (JOINs in Clickhouse are always compute heavy operations since it is a columnar store. ClickHouse takes the right table and creates a hash table for it in RAM).

You can use UNION ALL and ARRAY JOIN in specific way to generate missing rows:

CREATE TABLE connections
(
    `timestamp` DateTime,
    `username` LowCardinality(String),
    `event` enum('Connect', 'Disconnect')
)
ENGINE = Memory;

INSERT INTO connections VALUES
('2022-12-20 18:24:00'::DateTime,   '1',    'Connect')
('2022-12-20 18:30:00'::DateTime,   '1',    'Disconnect')
('2022-12-20 18:34:00'::DateTime,   '1',    'Connect')
('2022-12-21 12:07:00'::DateTime,   '1',    'Disconnect')
('2022-12-20 12:15:00'::DateTime,   '2',    'Connect')
('2022-12-20 12:47:00'::DateTime,   '2',    'Disconnect');

SELECT * FROM 
  (
    SELECT 
      timestamp, username, event 
    FROM 
      connections 
    UNION ALL 
    SELECT 
      timestamp, username, event 
    FROM 
      (
        SELECT 
          [toStartOfDay(timestamp) + INTERVAL '1 DAY' - INTERVAL '1 SECOND', 
          toStartOfDay(timestamp) + INTERVAL '1 DAY' ] timestamps, 
          username, 
          [ 'Disconnect', 'Connect' ] :: Array(Enum('Connect', 'Disconnect')) events 
        FROM 
          connections 
        GROUP BY 
          toStartOfDay(timestamp), username 
        HAVING 
          anyLast(event) = 'Connect'
      ) ARRAY JOIN 
        timestamps AS timestamp, 
        events AS event
  ) 
ORDER BY 
  username, timestamp

Here is the result:

┌───────────timestamp─┬─username─┬─event──────┐
│ 2022-12-20 18:24:00 │ 1        │ Connect    │
│ 2022-12-20 18:30:00 │ 1        │ Disconnect │
│ 2022-12-20 18:34:00 │ 1        │ Connect    │
│ 2022-12-20 23:59:59 │ 1        │ Disconnect │
│ 2022-12-21 00:00:00 │ 1        │ Connect    │
│ 2022-12-21 12:07:00 │ 1        │ Disconnect │
│ 2022-12-20 12:15:00 │ 2        │ Connect    │
│ 2022-12-20 12:47:00 │ 2        │ Disconnect │
└─────────────────────┴──────────┴────────────┘

8 rows in set. Elapsed: 0.011 sec.
JustMe
  • 2,329
  • 3
  • 23
  • 43
  • thanks sir! this works fine and does not demand `insert into` clause, which I can't implement due to read-only mode. One more question - do you know what should I amend in this query to make it work on Postges or MSSQL? I have very similar task in another database. MSSQL does not have such flexibility with arrays, as far as I know. – Daniel G Dec 23 '22 at 08:26
  • I'm now trying rewrite it for Postgres, but ARRAY JOIN is the difficulty, since Postgres does not have this one (I can't find such function in documentation) – Daniel G Dec 23 '22 at 09:05
  • PostgreSQL has `ARRAY JOIN` equivalent called `UNNEST`: https://www.postgresql.org/docs/15/functions-array.html – JustMe Dec 23 '22 at 09:31