The question in short
What is an efficient, scalable way of selecting two (or more) rows from a table with consecutive IDs, especially if this table is joined with another table?
Related questions have been asked before on Stack Overflow, e.g.:
The answers to these questions suggest a self-join. My working example described below uses that suggestion, but it performs very, very poorly on larger data sets. I've ran out of ideas how to improve it, and I'd really appreciate your input.
The issue in detail
Let's assume I were developing a database that keeps track of ball possession during a football/soccer match (please understand that I can't disclose the purpose of my real application). I require an efficient, scalable way that allows me to query changes of ball possession from one player to another (i.e. passes). For example, I might be interested in a list of all passes from any defender to any forward.
Mock database structure
My mock database consists of two tables, The first table Players
stores the players' names in the Name
column and their position (GOA, DEF, MID, FOR for goalie, defender, midfield, forward) in the POS
column.
The second table Possession
keeps track of ball possession. Whenever ball possession changes, i.e. the ball is passed to a new player, a row is added to this table. The primary key ID
also indicates the temporal order of possession changes: consecutive IDs indicate an immediate sequence of ball possessions.
CREATE TABLE Players(
ID INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
POS VARCHAR(3) NOT NULL,
Name VARCHAR(7) NOT NULL);
CREATE TABLE Possession(
ID INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
PlayerID INT NOT NULL);
Next, we create some indices:
CREATE INDEX POS ON Players(POS);
CREATE INDEX Name ON Players(Name);
CREATE INDEX PlayerID ON Possession(PlayerID);
Now, we populate the Players
table with a few players, and also add test entries to the Possession
table:
INSERT INTO Players (POS, Name) VALUES
('DEF', 'James'), ('DEF', 'John'), ('DEF', 'Michael'),
('DEF', 'David'), ('MID', 'Charles'), ('MID', 'Thomas'),
('MID', 'Paul'), ('FOR', 'Bob'), ('GOAL', 'Kenneth');
INSERT INTO Possession (PlayerID) VALUES
(1), (8), (2), (5), (3), (8), (3), (9), (6), (4), (7), (9);
Let's quickly check our database by joining the Possession
and the Players
table:
SELECT Possession.ID, PlayerID, POS, Name
FROM
Possession
INNER JOIN Players ON Possession.PlayerID = Players.ID
ORDER BY Possession.ID;
This looks good:
+----+----------+-----+---------+
| ID | PlayerID | POS | Name |
+----+----------+-----+---------+
| 1 | 1 | DEF | James |
| 2 | 8 | FOR | Bob |
| 3 | 2 | DEF | John |
| 4 | 5 | MID | Charles |
| 5 | 3 | DEF | Michael |
| 6 | 8 | FOR | Bob |
| 7 | 3 | DEF | Michael |
| 8 | 9 | GOA | Kenneth |
| 9 | 6 | MID | Thomas |
| 10 | 4 | DEF | David |
| 11 | 7 | MID | Paul |
| 12 | 9 | GOA | Kenneth |
+----+----------+-----+---------+
The table can be read like this: First, the DEFender James passed to the FORward Bob. Then, Bob passed to the DEFender John, who in turn passed to the MIDfield Charles. After some more passes, the ball ends with the GOAlkeeper Kenneth.
Working solution
I need a query that lists all passes from any defender to any forward. As we can see in the previous table, there are two instances of that: right at the start, James sends the ball to Bob (row ID 1 to ID 2), and later on, Michael sends the ball to Bob (row ID 5 to ID 6).
In order to do this in SQL, I create a self-join for the Possession
table, with the second instance being offset by one row. In order to be able to access the players' names and positions, I also join the two Possession
table instances to the Players
table. The following query does that:
SELECT
M1.ID AS "From",
M2.ID AS "To",
P1.Name AS "Sender",
P2.Name AS "Receiver"
FROM
Possession AS M1
INNER JOIN Possession as M2 ON M2.ID = M1.ID + 1
INNER JOIN Players as P1 ON M1.PlayerId = P1.ID AND P1.POS = "DEF" -- see execution plan
INNER JOIN Players as P2 ON M2.PlayerId = P2.ID AND P2.POS = "FOR"
We get the expected output:
+------+----+---------+----------+
| From | To | Sender | Receiver |
+------+----+---------+----------+
| 1 | 2 | James | Bob |
| 5 | 6 | Michael | Bob |
+------+----+---------+----------+
The problem
While this query is executed virtually instantly in the mock football database, there appears to be a problem in the execution plan with this query. Here is the output of EXPLAIN
for it:
+------+-------------+-------+------+------------------+----------+---------+------------+------+-------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+------------------+----------+---------+------------+------+-------------------------------------------------+
| 1 | SIMPLE | P2 | ref | PRIMARY,POS | POS | 5 | const | 1 | Using index condition |
| 1 | SIMPLE | M2 | ref | PRIMARY,PlayerID | PlayerID | 4 | MOCK.P2.ID | 1 | Using index |
| 1 | SIMPLE | P1 | ALL | PRIMARY,POS | NULL | NULL | NULL | 9 | Using where; Using join buffer (flat, BNL join) |
| 1 | SIMPLE | M1 | ref | PlayerID | PlayerID | 4 | MOCK.P1.ID | 1 | Using where; Using index |
+------+-------------+-------+------+------------------+----------+---------+------------+------+-------------------------------------------------+
I have to admit that I'm not very good at interpreting query execution plans. But it seems to me that the third row indicates a bottle neck for the join marked in the query above: apparently, a full scan is done for the P1
alias table, no key seems to be used even though POS
and the primary key are available, and the join buffer (flat, BNL join)
part is also very suspicious. I don't know what any of that means, but I usually don't find this with normal joins.
Perhaps due to this bottle neck, the query does not finish within any acceptable time span for my real database. My real equivalent to the mock Players
table has ~60,000 rows, and the Possession
equivalent has ~1,160,000 rows. I monitored the execution of the query via SHOW PROCESSLIST
. After more than 600 seconds, the process was still tagged as Sending data
, at which point I killed the process.
The query plan on this larger data set is rather similar to the one for the small mock data set. The third join appears to be problematic with no key used, a full table scan being performed, and the join buffer part that I don't really understand:
+------+-------------+-------+------+---------------+----------+---------+------------------+-------+-------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+---------------+----------+---------+------------------+-------+-------------------------------------------------+
| 1 | SIMPLE | P2 | ref | POS | POS | 1 | const | 1748 | Using index condition |
| 1 | SIMPLE | M2 | ref | PlayerId | PlayerId | 2 | REAL.P2.PlayerId | 7 | |
| 1 | SIMPLE | P1 | ALL | POS | NULL | NULL | NULL | 61917 | Using where; Using join buffer (flat, BNL join) |
| 1 | SIMPLE | M1 | ref | PlayerId | PlayerId | 2 | REAL.P1.PlayerId | 7 | Using where |
+------+-------------+-------+------+---------------+----------+---------+-----------------------+-------+-------------------------------------------------+
I tried forcing an index for the aliased table P1
by using Players AS P1 FORCE INDEX (POS)
instead of Players AS P1
in the query shown above. This change does affect the execution plan. If I force POS
to be used as the key, the third line in the output of EXPLAIN
is very similar to the first line. The only difference is the number of rows, which is still very high (30912). Even this modified query did not complete after 600 seconds.
I don't think that this is a configuration issue. I have made up to 18 GB of RAM available to the MySQL server, and the server uses this memory for other queries. For the present query, memory consumption does not exceed 2 GB of RAM.
Back to the question
Thanks for staying this somewhat long-winded explanation up to this point!
Let's return to the initial question: What is an efficient, scalable way of selecting two (or more) rows from a table with consecutive IDs, especially if this table is joined with another table?
My current query certainly is doing something wrong, as it didn't finish even after ten minutes. Is there something that I can change in my current query to make it useful for my larger real data set? If not: is there an alternative, better solution that I could use?