2

The question in short

What is an efficient, scalable way of selecting two (or more) rows from a table with consecutive IDs, especially if this table is joined with another table?

Related questions have been asked before on Stack Overflow, e.g.:

The answers to these questions suggest a self-join. My working example described below uses that suggestion, but it performs very, very poorly on larger data sets. I've ran out of ideas how to improve it, and I'd really appreciate your input.

The issue in detail

Let's assume I were developing a database that keeps track of ball possession during a football/soccer match (please understand that I can't disclose the purpose of my real application). I require an efficient, scalable way that allows me to query changes of ball possession from one player to another (i.e. passes). For example, I might be interested in a list of all passes from any defender to any forward.

Mock database structure

My mock database consists of two tables, The first table Players stores the players' names in the Name column and their position (GOA, DEF, MID, FOR for goalie, defender, midfield, forward) in the POS column.

The second table Possession keeps track of ball possession. Whenever ball possession changes, i.e. the ball is passed to a new player, a row is added to this table. The primary key ID also indicates the temporal order of possession changes: consecutive IDs indicate an immediate sequence of ball possessions.

CREATE TABLE Players(
    ID INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    POS VARCHAR(3) NOT NULL,
    Name VARCHAR(7) NOT NULL);

CREATE TABLE Possession(
    ID INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    PlayerID INT NOT NULL);

Next, we create some indices:

CREATE INDEX POS ON Players(POS);
CREATE INDEX Name ON Players(Name);

CREATE INDEX PlayerID ON Possession(PlayerID);

Now, we populate the Players table with a few players, and also add test entries to the Possession table:

INSERT INTO Players (POS, Name) VALUES 
    ('DEF', 'James'), ('DEF', 'John'), ('DEF', 'Michael'), 
    ('DEF', 'David'), ('MID', 'Charles'), ('MID', 'Thomas'), 
    ('MID', 'Paul'), ('FOR', 'Bob'), ('GOAL', 'Kenneth');

INSERT INTO Possession (PlayerID) VALUES 
    (1), (8), (2), (5), (3), (8), (3), (9), (6), (4), (7), (9);

Let's quickly check our database by joining the Possession and the Players table:

SELECT Possession.ID, PlayerID, POS, Name
FROM 
    Possession
    INNER JOIN Players ON Possession.PlayerID = Players.ID 
ORDER BY Possession.ID;

This looks good:

+----+----------+-----+---------+
| ID | PlayerID | POS | Name    |
+----+----------+-----+---------+
|  1 |        1 | DEF | James   |
|  2 |        8 | FOR | Bob     |
|  3 |        2 | DEF | John    |
|  4 |        5 | MID | Charles |
|  5 |        3 | DEF | Michael |
|  6 |        8 | FOR | Bob     |
|  7 |        3 | DEF | Michael |
|  8 |        9 | GOA | Kenneth |
|  9 |        6 | MID | Thomas  |
| 10 |        4 | DEF | David   |
| 11 |        7 | MID | Paul    |
| 12 |        9 | GOA | Kenneth |
+----+----------+-----+---------+

The table can be read like this: First, the DEFender James passed to the FORward Bob. Then, Bob passed to the DEFender John, who in turn passed to the MIDfield Charles. After some more passes, the ball ends with the GOAlkeeper Kenneth.

Working solution

I need a query that lists all passes from any defender to any forward. As we can see in the previous table, there are two instances of that: right at the start, James sends the ball to Bob (row ID 1 to ID 2), and later on, Michael sends the ball to Bob (row ID 5 to ID 6).

In order to do this in SQL, I create a self-join for the Possession table, with the second instance being offset by one row. In order to be able to access the players' names and positions, I also join the two Possession table instances to the Players table. The following query does that:

SELECT 
    M1.ID AS "From",
    M2.ID AS "To",
    P1.Name AS "Sender",
    P2.Name AS "Receiver"
FROM
    Possession AS M1
    INNER JOIN Possession as M2 ON M2.ID = M1.ID + 1
    INNER JOIN Players as P1 ON M1.PlayerId = P1.ID AND P1.POS = "DEF" -- see execution plan
    INNER JOIN Players as P2 ON M2.PlayerId = P2.ID AND P2.POS = "FOR"

We get the expected output:

+------+----+---------+----------+
| From | To | Sender  | Receiver |
+------+----+---------+----------+
|    1 |  2 | James   | Bob      |
|    5 |  6 | Michael | Bob      |
+------+----+---------+----------+

The problem

While this query is executed virtually instantly in the mock football database, there appears to be a problem in the execution plan with this query. Here is the output of EXPLAIN for it:

+------+-------------+-------+------+------------------+----------+---------+------------+------+-------------------------------------------------+
| id   | select_type | table | type | possible_keys    | key      | key_len | ref        | rows | Extra                                           |
+------+-------------+-------+------+------------------+----------+---------+------------+------+-------------------------------------------------+
|    1 | SIMPLE      | P2    | ref  | PRIMARY,POS      | POS      | 5       | const      |    1 | Using index condition                           |
|    1 | SIMPLE      | M2    | ref  | PRIMARY,PlayerID | PlayerID | 4       | MOCK.P2.ID |    1 | Using index                                     |
|    1 | SIMPLE      | P1    | ALL  | PRIMARY,POS      | NULL     | NULL    | NULL       |    9 | Using where; Using join buffer (flat, BNL join) |
|    1 | SIMPLE      | M1    | ref  | PlayerID         | PlayerID | 4       | MOCK.P1.ID |    1 | Using where; Using index                        |
+------+-------------+-------+------+------------------+----------+---------+------------+------+-------------------------------------------------+

I have to admit that I'm not very good at interpreting query execution plans. But it seems to me that the third row indicates a bottle neck for the join marked in the query above: apparently, a full scan is done for the P1 alias table, no key seems to be used even though POS and the primary key are available, and the join buffer (flat, BNL join) part is also very suspicious. I don't know what any of that means, but I usually don't find this with normal joins.

Perhaps due to this bottle neck, the query does not finish within any acceptable time span for my real database. My real equivalent to the mock Players table has ~60,000 rows, and the Possession equivalent has ~1,160,000 rows. I monitored the execution of the query via SHOW PROCESSLIST. After more than 600 seconds, the process was still tagged as Sending data, at which point I killed the process.

The query plan on this larger data set is rather similar to the one for the small mock data set. The third join appears to be problematic with no key used, a full table scan being performed, and the join buffer part that I don't really understand:

+------+-------------+-------+------+---------------+----------+---------+------------------+-------+-------------------------------------------------+
| id   | select_type | table | type | possible_keys | key      | key_len | ref              | rows  | Extra                                           |
+------+-------------+-------+------+---------------+----------+---------+------------------+-------+-------------------------------------------------+
|    1 | SIMPLE      | P2    | ref  | POS           | POS      | 1       | const            |  1748 | Using index condition                           |
|    1 | SIMPLE      | M2    | ref  | PlayerId      | PlayerId | 2       | REAL.P2.PlayerId |     7 |                                                 |
|    1 | SIMPLE      | P1    | ALL  | POS           | NULL     | NULL    | NULL             | 61917 | Using where; Using join buffer (flat, BNL join) |
|    1 | SIMPLE      | M1    | ref  | PlayerId      | PlayerId | 2       | REAL.P1.PlayerId |     7 | Using where                                     |
+------+-------------+-------+------+---------------+----------+---------+-----------------------+-------+-------------------------------------------------+

I tried forcing an index for the aliased table P1 by using Players AS P1 FORCE INDEX (POS) instead of Players AS P1 in the query shown above. This change does affect the execution plan. If I force POS to be used as the key, the third line in the output of EXPLAIN is very similar to the first line. The only difference is the number of rows, which is still very high (30912). Even this modified query did not complete after 600 seconds.

I don't think that this is a configuration issue. I have made up to 18 GB of RAM available to the MySQL server, and the server uses this memory for other queries. For the present query, memory consumption does not exceed 2 GB of RAM.

Back to the question

Thanks for staying this somewhat long-winded explanation up to this point!

Let's return to the initial question: What is an efficient, scalable way of selecting two (or more) rows from a table with consecutive IDs, especially if this table is joined with another table?

My current query certainly is doing something wrong, as it didn't finish even after ten minutes. Is there something that I can change in my current query to make it useful for my larger real data set? If not: is there an alternative, better solution that I could use?

Community
  • 1
  • 1
Schmuddi
  • 1,995
  • 21
  • 35
  • 2
    Wow, it's soooooooooooo long. – Blank Jul 20 '16 at 08:55
  • It translates to "self joins are slow. need alternative" – Ash Jul 20 '16 at 08:56
  • Yes, apologies for the long explanation. :( Yet, I had the feeling that I need to provide a working example that illustrates my problem, as self-joins have been suggested before as solutions to related problems. – Schmuddi Jul 20 '16 at 08:58
  • Copy-pasting your code into [SQL Fiddle](http://sqlfiddle.com/#!9/e52df/2) gives me a different execution plan with *type* = *ref* for the third line? – hsan Jul 20 '16 at 09:31
  • @hsan different MySQL instances using different data (even sometimes using the same data) may use different execution pathes, it is not suprising at all. – Shadow Jul 20 '16 at 09:39
  • @Schmuddi - do you have some more data (sql file) for the tables with the expected result to optimize the query. with so few rows the optimizer can works different – Bernd Buffen Jul 20 '16 at 14:02

2 Answers2

0

I believe the issue is that you only have single field indexes on the players table. MySQL can only use a single index per joined table.

In case of the player table 2 fields are key from performance point of view:

  • playerid, since it is used in the join;
  • pos, since you filter on it.

You seem to have standalone indexes on both fields, but this forces MySQL to choose whether to use index for joining the 2 tables or to filter based on the where criteria.

I would create a multi-column index on playerid, pos fields (in this order), which can satisfy both the join and the where. This way MySQL can use a single index to satisfy both the join and the where.

I would also use explicit join instead of the comma separated list of tables with the join condition in where for better readability.

Shadow
  • 33,525
  • 10
  • 51
  • 64
  • Thanks for suggesting a multi-column index. Your explanation why this could help here makes perfect sense, but unfortunately, the query doesn't really benefit from it. I created the index by `CREATE INDEX ID_POS ON Players(ID, POS)`, and the index is shown by `EXPLAIN` as a possible key both for table `P1` and `P2`, but it's only used for `P2`. For `P1`, the execution plan remains unchanged. I can `FORCE` it also for `P1`, but the execution time of the query on my real database is still very much abysmal. Any further ideas? – Schmuddi Jul 22 '16 at 08:06
  • What if you rewrite your query to use explicit joins (m1 inner join m2 on m1.id+1=m2.id...) instead of implicit (comma separated list of tables, joining conditions in where)? You could also check if your MySQL version supports indexed generated fields and could create a generated column with an index that adds 1 to the id field and use that for the join. – Shadow Jul 22 '16 at 08:26
  • I changed the queries to explicit joins also in the question; they are more readable indeed. However, this change doesn't affect the execution plan. I'll still have to try out your new idea to use generated columns, which sounds promising. – Schmuddi Jul 22 '16 at 10:30
  • I've tried using an indexed generated column – this appears to be the solution to my problem! The query that didn't even finish after more than 10 minutes now completes after less than 4 seconds. There are still a few things that I need to consider, but your suggestion will probably answer my question – thanks a lot! However, before I accept it, you might want to edit your answer so that the actually working solution is found in the answer text, and not only in the comment. – Schmuddi Jul 25 '16 at 09:17
0

Here's a general plan:

SELECT
        @n := @n + 1  AS N,  -- Now the rows will be numbered 1,2,3,...
        ...
    FROM ( SELECT @n := 0 ) AS init
    JOIN tbl
    ORDER BY ...  -- based on your definition of 'consecutive'

Then you can use that query as a subquery somewhere else.

SELECT ...
    FROM ( the above query )  AS x
    GROUP BY ceiling(N/2)  -- 1&2 will be grouped together; 3&4; etc

You can use `IF((N % 2) = 1, ..., ...) to different things with first versus second item in each pair.

You mentioned JOINing to another table. If possible, avoid doing the JOIN until this last SELECT.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Thanks for the suggestion. I think I don't need the first subquery, because the `ID` field in `Possession` doubles as the primary key and a numbering indicating "consecutiveness". Using `CEILING` to group consecutive rows is an interesting idea. But I think it won't work in my case. I might be interested not only in groups 1&2, 3&4, etc., but also in group 2&3 and 4&5. This wouldn't be possible using your idea, right? – Schmuddi Jul 22 '16 at 08:12
  • You have N rows? And you need N-1 "groups"? Plan A: `UNION` of 1&2, etc with 2&3 etc. Plan B: walk through the table some other way. – Rick James Jul 22 '16 at 16:01