1

I am trying (and failing) to join some tables in a SQLite database. The data itself is complicated but I think I have boiled it down to an illustrative example.

Here are the three tables I want to join.

Table: Events

+----+---------+-------+-----------+
| id | user_id | class | timestamp |
+----+---------+-------+-----------+
|  1 | 'user1' |     6 |       100 |
|  2 | 'user1' |    12 |       400 |
|  3 | 'user1' |     4 |       900 |
|  4 | 'user2' |     6 |       400 |
|  5 | 'user2' |     3 |       800 |
|  6 | 'user2' |     8 |       900 |
+----+---------+-------+-----------+

Table: Games

+---------+---------+------------+-----------+
| user_id | game_id | game_class | timestamp |
+---------+---------+------------+-----------+
| 'user1' |       1 | 'A'        |       200 |
| 'user2' |       2 | 'A'        |       300 |
| 'user1' |       3 | 'B'        |       500 |
| 'user1' |       4 | 'A'        |       600 |
| 'user1' |       5 | 'A'        |       700 |
+---------+---------+------------+-----------+

Table: AScores

+---------+-------+
| game_id | score |
+---------+-------+
|       1 |     8 |
|       2 |     2 |
|       4 |     9 |
|       5 |     6 |
+---------+-------+

I would like to join these to provide an additional column on the first table containing the users current score in game class A at the time of the event. I.e. I would like theresult of the join to look like this:

Desired Result

+----+----------+-------+-----------+-----------------+
| id | user_id  | class | timestamp | current_a_score |
+----+----------+-------+-----------+-----------------+
|  1 |  'user1' |     6 |       100 | (null)          |
|  2 |  'user1' |    12 |       400 | 8               |
|  3 |  'user1' |     4 |       900 | 6               |
|  4 |  'user2' |     6 |       400 | 2               |
|  5 |  'user2' |     3 |       800 | 2               |
|  6 |  'user2' |     8 |       900 | 2               |
+----+----------+-------+-----------+-----------------+

The following simple join pulls together the two tables AScores and Games.

SELECT * FROM AScores
INNER JOIN Games
ON AScores.game_id = Games.game_id

And so I was hoping to join this to the Events table as a sub-query. Something like this:

SELECT Events.*, AScoredGames.time_stamp AS game_time_stamp, AScoredGames.score
FROM Events
LEFT OUTER JOIN (
    SELECT AScores.score, Games.* FROM AScores
    INNER JOIN Games
    ON AScores.game_id = Games.game_id
) AS AScoredGames
ON Events.user_id = AScoredGames.user_id 
AND Events.time_stamp >= AScoredGames.time_stamp
ORDER BY Events.time_stamp ASC

That results in the following:

+----+---------+-------+------------+-----------------+-------+
| id | user_id | class | time_stamp | game_time_stamp | score |
+----+---------+-------+------------+-----------------+-------+
|  1 | user1   |     6 | 100        | NULL            | NULL  |
|  2 | user1   |    12 | 400        | 200             | 8     |
|  4 | user2   |     6 | 400        | 300             | 2     |
|  5 | user2   |     3 | 800        | 300             | 2     |
|  6 | user2   |     8 | 900        | 300             | 2     |
|  3 | user1   |     4 | 900        | 200             | 8     |
|  3 | user1   |     4 | 900        | 600             | 9     |
|  3 | user1   |     4 | 900        | 700             | 6     |
+----+---------+-------+------------+-----------------+-------+

So I need to group by Events.id to get rid of the triplicated row with Events.id 3. But what I want to do is to choose the row with the maximum game_time_stamp but then use the row's score. If I do MAX(game_time_stamp) as my aggregation I still have to independently aggregate the score. Is there a way to tie the row choice in the score column's aggregation function to the result of the game_time_stamp column's aggregation function?

(N.B. Existing answers to questions like Select first record in a One-to-Many relation using left join and SQL Server: How to Join to first row seem to suggest I cannot and say one must use a WHERE clause over a sub-query. But I am struggling with that (I'll post another question about that) and I can think of at least one solution and I am hoping there are better ones.)

GMB
  • 216,147
  • 25
  • 84
  • 135
dumbledad
  • 16,305
  • 23
  • 120
  • 273

2 Answers2

1

The following query should do it. It uses a NOT EXISTS condition with a correlated subquery to locate the relevant game record for each event.

SELECT e.*, s.score current_a_score
FROM 
    events e
    LEFT JOIN games g 
        ON  g.user_id = e .user_id
        AND g.timestamp < e.timestamp
        AND NOT EXISTS (
            SELECT 1 
            FROM games g1
            WHERE 
                g1.user_id = e .user_id
                AND g1.timestamp < e.timestamp 
                AND g1.timestamp > g.timestamp
        )
    LEFT JOIN ascores s 
        ON  s.game_id = g.game_id
ORDER BY e.id

This DB Fiddle demo with your test data returns :

| id  | user_id | class | timestamp | current_a_score |
| --- | ------- | ----- | --------- | --------------- |
| 1   | user1   | 6     | 100       |                 |
| 2   | user1   | 12    | 400       | 8               |
| 3   | user1   | 4     | 900       | 6               |
| 4   | user2   | 6     | 400       | 2               |
| 5   | user2   | 3     | 800       | 2               |
| 6   | user2   | 8     | 900       | 2               |
dumbledad
  • 16,305
  • 23
  • 120
  • 273
GMB
  • 216,147
  • 25
  • 84
  • 135
  • That looks great. For my real data it'll be more complicated because the Games table contains lots of rows for games I am not interested in and I only know which I need by joining with the scores table (i.e. there's no 'class' column). But this answer will help. Is `NOT EXISTS` costly on large data-sets? – dumbledad Feb 13 '19 at 20:46
  • @dumbledad : AFAIK `NOT EXISTS` is the most efficient way to do this... If performance matters, you should consider creating an index on `games(user_id, timestamp)`. – GMB Feb 13 '19 at 21:18
0

I have one work-around, but it feels hacky and relies on the specifics of my data. First note that the time_stamps are all multiples of 100 while the scores are all below 10. I can acombine these in a way that will not interfere with my comparison but will mean they are both encoded in one numeric column. This query gives the desired result:

SELECT Events.id, MIN(Events.user_id) AS user_id, MIN(Events.class) AS class, MIN(Events.time_stamp) AS time_stamp, MAX(AScoredGames.combination) % 10 AS current_a_score
FROM Events
LEFT OUTER JOIN (
        SELECT AScores.score, AScores.score + (Games.time_stamp - 10) AS combination, Games.* FROM AScores
        INNER JOIN Games
        ON AScores.game_id = Games.game_id) AS AScoredGames
ON Events.user_id = AScoredGames.user_id AND Events.time_stamp >= AScoredGames.time_stamp
GROUP BY Events.id
ORDER BY id ASC

(The combining is done in AScores.score + (Games.time_stamp - 10) and so the aggregate function becomes MAX(AScoredGames.combination) % 10.)

Actual Result

+----+---------+-------+------------+-----------------+
| id | user_id | class | time_stamp | current_a_score |
+----+---------+-------+------------+-----------------+
|  1 | user1   |     6 |        100 | NULL            |
|  2 | user1   |    12 |        400 | 8               |
|  3 | user1   |     4 |        900 | 6               |
|  4 | user2   |     6 |        400 | 2               |
|  5 | user2   |     3 |        800 | 2               |
|  6 | user2   |     8 |        900 | 2               |
+----+---------+-------+------------+-----------------+
dumbledad
  • 16,305
  • 23
  • 120
  • 273