Group only certain rows with GROUP BY

Question

SCHEMA

I have the following set-up in MySQL database:

CREATE TABLE items (
  id SERIAL,
  name VARCHAR(100),
  group_id INT,
  price DECIMAL(10,2),
  KEY items_group_id_idx (group_id),
  PRIMARY KEY (id)
);

INSERT INTO items VALUES 
(1, 'Item A', NULL, 10),
(2, 'Item B', NULL, 20),
(3, 'Item C', NULL, 30),
(4, 'Item D', 1,    40),
(5, 'Item E', 2,    50),
(6, 'Item F', 2,    60),
(7, 'Item G', 2,    70);

PROBLEM

I need to select:

All items with group_id that has NULL value, and

One item from each group identified by group_id having the lowest price.

EXPECTED RESULTS

+----+--------+----------+-------+
| id | name   | group_id | price |
+----+--------+----------+-------+
|  1 | Item A |     NULL | 10.00 | 
|  2 | Item B |     NULL | 20.00 | 
|  3 | Item C |     NULL | 30.00 | 
|  4 | Item D |        1 | 40.00 | 
|  5 | Item E |        2 | 50.00 | 
+----+--------+----------+-------+

POSSIBLE SOLUTION 1: Two queries with UNION ALL

SELECT id, name, group_id, price FROM items
WHERE group_id IS NULL
UNION ALL
SELECT id, name, MIN(price) FROM items
WHERE group_id IS NOT NULL
GROUP BY group_id;

/* EXPLAIN */
+----+--------------+------------+------+--------------------+--------------------+---------+-------+------+----------------------------------------------+
| id | select_type  | table      | type | possible_keys      | key                | key_len | ref   | rows | Extra                                        |
+----+--------------+------------+------+--------------------+--------------------+---------+-------+------+----------------------------------------------+
|  1 | PRIMARY      | items      | ref  | items_group_id_idx | items_group_id_idx | 5       | const |    3 | Using where                                  | 
|  2 | UNION        | items      | ALL  | items_group_id_idx | NULL               | NULL    | NULL  |    7 | Using where; Using temporary; Using filesort | 
| NULL | UNION RESULT | <union1,2> | ALL  | NULL               | NULL               | NULL    | NULL  | NULL |                                              | 
+----+--------------+------------+------+--------------------+--------------------+---------+-------+------+----------------------------------------------+

However it is undesirable to have two queries since there will be more complex condition in WHERE clause and I would need to sort the final results.

POSSIBLE SOLUTION 2: GROUP BY on expression (reference)

SELECT id, name, group_id, MIN(price) FROM items
GROUP BY CASE WHEN group_id IS NOT NULL THEN group_id ELSE RAND() END;

/* EXPLAIN */
+----+-------------+-------+------+---------------+------+---------+------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                           |
+----+-------------+-------+------+---------------+------+---------+------+------+---------------------------------+
|  1 | SIMPLE      | items | ALL  | NULL          | NULL | NULL    | NULL |    7 | Using temporary; Using filesort | 
+----+-------------+-------+------+---------------+------+---------+------+------+---------------------------------+

Solution 2 seems to be faster and simple to use but I'm wondering whether there is a better approach in terms of performance.

UPDATE:

According to documentation referenced by @axiac, this query is illegal in SQL92 and earlier and may work in MySQL only.

*"One item from each group ..."* -- this is **not** a job for `GROUP BY`. — axiac, Mar 15 '16 at 12:18
`GROUP BY` **generates** one row in the result set from each group. It does not select a row from the group. You can use `MIN(price)` to get the lowest price but there is no link between it and the row that contains the lowest price. There can be even two or more rows having that price. Using `GROUP BY` there is no way to get the entire row that has the lowest price. Read more about [how MySQL handles `GROUP BY`](http://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html). Check [this answer](http://stackoverflow.com/a/28090544/4265352) for a solution to your problem. — axiac, Mar 15 '16 at 12:34
Possible duplicate of [Get records with max value for each group of grouped SQL results](http://stackoverflow.com/questions/12102200/get-records-with-max-value-for-each-group-of-grouped-sql-results) — axiac, Mar 15 '16 at 12:35
Same method as your 2nd query but wiht lower chance of collisions `SELECT id, SUBSTRING_INDEX(GROUP_CONCAT(name ORDER BY price),',',1), MIN(price) FROM items GROUP BY IFNULL(group_id,UUID()) ORDER BY id` — Mihai, Mar 15 '16 at 12:52
I used the GROUP_CONCAT trick to make sure you always get the name with the lower price,in your query it is not guaranteed. — Mihai, Mar 15 '16 at 12:58
@axiac, thank you for excellent description and the reference to the documentation and your answer to somewhat similar question. Although I disagree that my question is duplicate because I'm not just looking to aggregate results, I also need do it conditionally. — Gyrocode.com, Mar 15 '16 at 14:12
The `GROUP_CONCAT()` trick proposed by @Mihai should also work but it produces bloated queries, difficult to read and understand especially when there are many fields that needs to be retrieved from the table. It also has the disadvantage that for each field you have to carefully choose a delimiter to be able to split the string returned by `GROUP_CONCAT()`. For some fields (that contain user-input data) it could be impossible. — axiac, Mar 15 '16 at 14:49

score 1 · Accepted Answer · edited May 23 '17 at 12:24

According to this answer by @axiac, better solution in terms of compatibility and performance is shown below.

It is also explained in SQL Antipatterns book, Chapter 15: Ambiguous Groups.

To improve performance, combined index is also added for (group_id, price, id).

SOLUTION

SELECT a.id, a.name, a.group_id, a.price
FROM items a
LEFT JOIN items b 
ON a.group_id = b.group_id 
AND (a.price > b.price OR (a.price = b.price and a.id > b.id))
WHERE b.price is NULL;

See explanation on how it works for more details.

By accident as a side-effect this query works in my case where I needed to include ALL records with group_id equals to NULL AND one item from each group with the lowest price.

RESULT

+----+--------+----------+-------+
| id | name   | group_id | price |
+----+--------+----------+-------+
|  1 | Item A |     NULL | 10.00 | 
|  2 | Item B |     NULL | 20.00 | 
|  3 | Item C |     NULL | 30.00 | 
|  4 | Item D |        1 | 40.00 | 
|  5 | Item E |        2 | 50.00 | 
+----+--------+----------+-------+

EXPLAIN

+----+-------------+-------+------+-------------------------------+--------------------+---------+----------------------------+------+--------------------------+
| id | select_type | table | type | possible_keys                 | key                | key_len | ref                        | rows | Extra                    |
+----+-------------+-------+------+-------------------------------+--------------------+---------+----------------------------+------+--------------------------+
|  1 | SIMPLE      | a     | ALL  | NULL                          | NULL               | NULL    | NULL                       |    7 |                          | 
|  1 | SIMPLE      | b     | ref  | PRIMARY,id,items_group_id_idx | items_group_id_idx | 5       | agi_development.a.group_id |    1 | Using where; Using index | 
+----+-------------+-------+------+-------------------------------+--------------------+---------+----------------------------+------+--------------------------+

And now add those "more complex conditions in WHERE" you were talking about :-) — dnoeth, Mar 15 '16 at 15:07
@dnoeth, that should be irrelevant as those conditions would need to be applied to every solution proposed here. I wanted to avoid `UNION ALL` to avoid applying the same complex condition twice. — Gyrocode.com, Mar 15 '16 at 15:51
Not in my solution :) You still must apply the same conditions on both `items a` and `items b`, e.g. try to filter for a `minimum price > 50` — dnoeth, Mar 15 '16 at 16:19
@dnoeth, your solution is smart, I like `-id` part. But I could use `WHERE b.price is NULL AND a.price > 30` once and it works fine. — Gyrocode.com, Mar 15 '16 at 18:57
@dnoeth, that gives me 0 rows since I have no items with prices above 50, which is correct. — Gyrocode.com, Mar 15 '16 at 19:02
This is just an example ("find the minimum price over 50") because there are no other columns, but if you have conditions on other columns you must repeat them. — dnoeth, Mar 15 '16 at 20:20
@dnoeth, there is probably a misunderstanding, but I disagree. I can apply further conditions to table `a` only and that works fine, there is no need to repeat that clause for table `b` which is only there to find items with lowest price. — Gyrocode.com, Mar 15 '16 at 20:35

Gordon Linoff · Answer 2 · 2016-03-15T14:22:14.357

0

You can do this using where conditions:

SQLFiddle Demo

select t.*
from t
where t.group_id is null or
      t.price = (select min(t2.price)
                 from t t2
                 where t2.group_id = t.group_id
                );

Note that this returns all rows with the minimum price, if there is more than one for a given group.

EDIT:

I believe the following fixes the problem of multiple rows:

select t.*
from t
where t.group_id is null or
      t.id = (select t2.id
              from t t2
              where t2.group_id = t.group_id
              order by t2.price asc
              limit 1
             );

Unfortunately, SQL Fiddle is not working for me right now, so I cannot test it.

edited Mar 15 '16 at 14:22

answered Mar 15 '16 at 12:15

Gordon Linoff

1,242,037
58
646
786

Thanks for your suggestion but your query returns two rows when `group_id` and `price` are the same, see http://sqlfiddle.com/#!9/2c179/1 – Gyrocode.com Mar 15 '16 at 12:25
@Gyrocode.com . . . That is exactly what the "Note" says. – Gordon Linoff Mar 15 '16 at 14:11
Thank you very much for your answer, your second query works fine now with duplicate prices within the same group. However when outputs from `EXPLAIN` are compared, seems that the query in [this answer](http://stackoverflow.com/a/36013983/3549014) is better in terms of performance. – Gyrocode.com Mar 15 '16 at 14:43
@GordonLinoff, I don't have the real data yet but strangely the timings on your query is slightly better and `SHOW STATUS LIKE 'last_query_cost'` returns `0` where my query returns `21`. Not sure why. – Gyrocode.com Mar 15 '16 at 19:14

score 0 · Answer 3 · answered Mar 15 '16 at 14:18

If group_id is always a positive value you can simplify it without GUID/RAND:

SELECT id, name, min(price) FROM items
GROUP BY COALESCE(group_id, -id); -- id is already unique

But both queries will not return a correct result if you change the order of Inserts, I'll add a Fiddle when it's working again...

Gordon's query should work as expected or you use an old trick to get another column for MIN: piggybacking.

You concat multiple columns as fixed length string, MIN column as #1 and apply the MIN on this string. In the next step you extract the columns again using matching SUBSTRING:

SELECT
   CASE WHEN grp > 0 THEN grp ELSE NULL END AS group_id
   ,CAST(SUBSTRING(x FROM 1 FOR 13) AS DECIMAL(10,2)) AS price
   ,SUBSTRING(x FROM 24) AS NAME
FROM
 (
   SELECT COALESCE(group_id, -id) AS grp
      -- results in a string like this
      -- '        50.00         5Item E'
      ,MIN(LPAD(CAST(price AS VARCHAR(13)),13) 
           || LPAD(CAST(id AS VARCHAR(10)),10)
           || NAME) AS x
   FROM items
   GROUP BY grp
 ) AS dt;

See [MySQL Handling of `GROUP BY`](http://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html) and [this comment](http://stackoverflow.com/questions/36010981/group-only-certain-rows-with-group-by/36013983#comment59674237_36010981). Although your first query works and somewhat simpler than my attempt #2, it's not 100% correct and could only work in MySQL. Your second query is too complex for such task. — Gyrocode.com, Mar 15 '16 at 14:28
@Gyrocode.com: I wrote "But both queries will not return a correct result" :) Regarding "complexity", it's just a single table access vs. a self-join based on non-equality. Btw, in almost every other DBMS this would be a simple task for `ROW_NUMBER`. — dnoeth, Mar 15 '16 at 14:39

Group only certain rows with GROUP BY

3 Answers3

Linked