Group By ignores sorting in subquery

Question

There is a TLDR version at the bottom.

Note: I have based my current solution on the proposed solution in this question here (proposed in the question text itself), however it does not work for me even if it works for that person. So I'm not sure how to handle this, because the question seems like a duplicate but the answer given there doesn't work for me. So I guess something must be different for me. If someone can tell me how to correctly handle this, I'm open to hearing.

I have a table like this one here:

scope_id    key_id    value
0           0         0_0
0           1         0_1
1           0         1_0
2           0         2_0
2           1         2_1

The scopes have a hierarchy where scope 0 is the parent of scope 2 and scope 2 is the parent of scope 1. (on purpose not sorted, they IDs are UUIDs, just for reading numbers here)

My use case is that I want the value of multiple keys in a specific scope (scope 1). However if there is no value defined for scope 1, I would be fine with a value from its parent (scope 2) and lastly if there is also no value in scope 2 I would take a value from its parent, scope 0. So if possible, I want the value from scope 1, if it doesn't have a value then from scope 2 and lastly I try to get the value from scope 0. (The scopes are a tree structure, so each scope can have max one parent, however a parent can have multiple childs).

So in the example above, if I want the value of key 0 in scope 1, I'd like to get 1_0 as the key is defined in the scope. If I want the value of key 1 in scope 1, I'd like to get 2_1 as there is no value defined in the scope 1 but in its parent scope 2 there is. And lastly if I want the value of keys 0 and 1 in scope 1, I want to get 1_0 and 2_1.

Currently it is solved by making 3 separate SQL requests and merging it in code. That works fine and fast enough, but I want to see if it would be faster with a single SQL query. I came up with the following query (based on the update in the question text here):

SELECT * 
FROM (
    SELECT * 
    FROM test
    WHERE key_id IN (0, 1)
        AND scope_id IN (1 , 2, 0)
    ORDER BY FIELD(scope_id, 1 , 2, 0)
) t1
GROUP BY t1.key_id;

The inner subquery first finds all keys that I want to look at and makes sure they are in the scope that I want to look at or it's parent scope. Then I order the scopes, so that first the child is, then the parent, then the grandparent. Now I expect group by to leave the value of the first row it finds, so hopefully the child (scope 1). However this doesn't work. Instead the first value based on the actual table is used.

TLDR

When grouping with GROUP BY in the query above, why is the order defined by the ORDER BY query ignored? Instead the first value based on the original table is taken when grouping.

Using this code you can try for yourself:

# this group by doesn't work with strict mode
SET sql_mode = '';
CREATE TABLE IF NOT EXISTS test(
    scope_id int,
    key_id int,
    `value` varchar(20),
    PRIMARY KEY (scope_id, key_id)
);
INSERT IGNORE INTO test values 
    (0, 0, "0_0"),
    (1, 0, "1_0"),
    (2, 0, "2_0"),
    (2, 1, "2_1"),
    (0, 1, "0_1");
    
SELECT * 
FROM (
    SELECT * 
    FROM test
    WHERE key_id IN (0, 1)
        AND scope_id IN (1 , 2, 0)
    ORDER BY FIELD(scope_id, 1 , 2, 0)
) t1
GROUP BY t1.key_id;
# expected result are the rows that contain value 1_0 and 2_1

@GMB I already have it in the text version, I also added it to the example at the bottom. — findusl, Feb 28 '20 at 18:28

GMB · Answer 1 · 2020-02-28T18:36:37.250

1

I understand your question as a greatest-n-per-group variant.

In this situation, you should not think aggregation, but filtering.

You could solve it with a correlated subquery that selects the first available scope_id per key_id:

select t.*
from test t
where t.scope_id = (
    select t1.scope_id 
    from test t1 
    where t1.key_id = t.key_id
    order by field(scope_id, 1, 2, 0)
    limit 1
)

For performance, you want an index on (key_id, scope_id).

Demo on DB Fiddle:

scope_id | key_id | value
-------: | -----: | :----
       1 |      0 | 1_0  
       2 |      1 | 2_1

edited Feb 28 '20 at 18:36

answered Feb 28 '20 at 18:34

GMB

216,147
25
84
135

Would that be efficient as it runs the subquery for every value? I don't have a big enough table to benchmark it easily. – findusl Feb 28 '20 at 18:36
@findusl: with the right index in place (see my answer), this should be an efficient solution. – GMB Feb 28 '20 at 18:38

score 0 · Answer 2 · answered Feb 28 '20 at 22:12

This will get what you want. Use a row number to effectively "save" your order for the next section of the query.

MySQL 8.0 or newer:

SELECT * 
FROM (
    SELECT *, ROW_NUMBER() rank
    FROM test
    WHERE key_id IN (0, 1)
        AND scope_id IN (1 , 2, 0)
    ORDER BY FIELD(scope_id, 1 , 2, 0)
) t1
GROUP BY t1.key_id
order by rank;

MySQL 5.7 or older:

SET @row_num = 0;
SELECT * 
FROM (
    SELECT *, @row_num := @row_num + 1 rank
    FROM test
    WHERE key_id IN (0, 1)
        AND scope_id IN (1 , 2, 0)
    ORDER BY FIELD(scope_id, 1 , 2, 0)
) t1
GROUP BY t1.key_id
ORDER BY rank;

Soap Box: MySQL results are, in general, horribly unreliable in any query that has 1 or more columns in a group by or aggregate but does not have all columns in a group by or aggregate.

Group By ignores sorting in subquery

TLDR

2 Answers2