MySQL SELECT Subquery -- WHERE value IN collapsed group-by values

Question

Underlying data:

mysql> select * from parent;
+----+------------+-----------+
| id | first_name | last_name |
+----+------------+-----------+
|  1 | Joe        | Wilkenson |
|  2 | Bob        | Thompson  |
|  3 | Bob        | Ericson   |
+----+------------+-----------+

mysql> select * from child;
+----+-----------+------------+-----------+---------+
| id | parent_id | first_name | last_name | fingers |
+----+-----------+------------+-----------+---------+
|  1 |         1 | Carla      | Wilkenson |      10 |
|  2 |         1 | Peter      | Wilkenson |       8 |
|  3 |         2 | Sam        | Thompson  |      10 |
|  4 |         3 | Beth       | Ericson   |      10 |
|  5 |         3 | Georgie    | Ericson   |      10 |
|  6 |         3 | Kyle       | Ericson   |       7 |
+----+-----------+------------+-----------+---------+

Goal: Figure out the total number of children's fingers, grouped by the parent's first name.

SELECT
  parent.first_name,
  ( SELECT SUM(fingers) FROM child WHERE child.parent_id IN ([collapsed parent.ids]) ) AS `kid_fingers`
FROM parent
GROUP BY parent.first_name;

I know that, with this specific example, the goal can be achieved by simply moving the subquery into a JOIN with a SUM(child.fingers). My actual scenario is far more complicated with a lot more table joins that would result in SUM() returning a very inflated and inaccurate number due to row duplication.

I also attempted to JOIN against a subquery but that's resulting in an unavoidable full scan on a large table which isn't feasible in my environment.

That leaves a SELECT subquery but there's the obvious problem that the field I need to compare against gets collapsed by the GROUP BY. This seems like a really good opportunity for MySQL to offer an aggregate function that simply brings up collapsed values for comparison, but I don't think one exists?

Any other tricks that might work for this? Worst case scenario I can do a REGEXP compare against GROUP_CONCAT() but that seems gross (and potentially risky) for many reasons. I'm unfortunately dealing with a very large and dynamically-built query that can have a lot of different GROUP BY combinations, so a full refactor or redesign is problematic.

P.S. Don't ask why the children are missing fingers. I just needed something to count.

The solution to your problem with multiple joins is to join with subqueries that calculate the grouped sums, instead of doing the sum in the main query. — Barmar, Aug 26 '21 at 23:42
See https://stackoverflow.com/questions/37978511/join-tables-with-sum-issue-in-mysql/37979049#37979049 — Barmar, Aug 26 '21 at 23:42
We're trying to avoid that. As mentioned, putting my real-world subquery in as a JOIN results in a full table scan and there can be millions of rows in that table. That subquery really needs to be limited to applicable IDs and that's only possible in a SELECT subquery. — Rikaelus, Aug 27 '21 at 00:45

Barmar · Answer 1 · 2021-08-26T23:57:12.573

2

You don't need IN, just use WHERE child.parent_id = parent.id.

Since that's summing for each parent ID, you can wrap it in another SUM() to combine all the parents with the same first name in the outer query.

SELECT
  parent.first_name,
  SUM(( SELECT SUM(fingers) FROM child WHERE child.parent_id = parent.id )) AS `kid_fingers`
FROM parent
GROUP BY parent.first_name;

DEMO

Another option is to use another nested subquery to get all the parent IDs with the same name, and sum that (I guess this is what you meant by "collapsed parent.ids").

SELECT
  parent.first_name,
  ( SELECT SUM(fingers) 
    FROM child 
    WHERE child.parent_id IN (
        SELECT p.id 
        FROM parent AS p 
        WHERE p.first_name = parent.first_name
    )
  ) AS `kid_fingers`
FROM parent
GROUP BY parent.first_name;

DEMO

And of course it can be done with a join:

SELECT p.first_name, SUM(c.kid_fingers) AS kid_fingers
FROM parent AS p
JOIN (
  SELECT parent_id, SUM(fingers) AS kid_fingers
  FROM child
  GROUP BY parent_id
) AS c ON c.parent_id = p.id
GROUP BY p.first_name

DEMO

edited Aug 26 '21 at 23:57

answered Aug 26 '21 at 23:51

Barmar

741,623
53
500
612

That first option seems surprisingly plausible. I wouldn't have thought MySQL would know what to do with the `parent.id` reference at that level, let alone correctly run it for each row within the given groups. I'll give that a shot in my real-world query. Unfortunately the other two options aren't feasible given the complexity of my joins and the nature of my tables. – Rikaelus Aug 27 '21 at 00:50
It's just an ordinary correlated subquery. – Barmar Aug 27 '21 at 00:52
But one that touches every row in the table in order to join the data, without the benefit of any filtering to indexes. That gets expensive when you have millions of rows and the vast majority of the data being scanned might not even be used. – Rikaelus Aug 27 '21 at 01:03
No, `WHERE child.parent_id = parent.id` makes it only touch the rows that have the specified foreign key. That's how correlated subqueries work. – Barmar Aug 27 '21 at 01:05
Right. That works for the SELECT subquery but not the JOIN subquery. I thought that's what you were referring to. I'm surprised it works in the SELECT like that because `parent.id` isn't an actual value that gets returned with the row. I've never seen a subquery done within an aggregation function like `SUM()` before. – Rikaelus Aug 27 '21 at 03:09
I was a little surprised I could do that, too. You can't put an aggregate inside another aggregate, but you can when the nested one is a correlated subquery. – Barmar Aug 27 '21 at 03:27
Unfortunately it's not working in my real-world situation where I have other JOINs causing duplicate rows that get squashed. The subquery is being run _for each row_, even if they're duplicates, and the result is being included in the `SUM`. See http://www.sqlfiddle.com/#!9/984c18/1 as an example. The end result *should* only be 18 but the subquery is running 4 times, resulting in 72. – Rikaelus Aug 27 '21 at 17:17
If you want a more specific answer, ask a more specific question. – Barmar Aug 27 '21 at 17:30
I did state in the OP that my real-life query included many additional joins and I explicitly stated that those joins result in `SUM` being inflated, which is what is happening with your solution. All the specifics were there. – Rikaelus Aug 27 '21 at 21:04
I know, but without seeing what they specifically are, it's hard for me to provide a solution that solves the problems you have. If joining with subqueries has performance problems, what we usually do is create a temporary table with the result of the subquery. Then you can add indexes to the temporary table, which will make the joins faster. – Barmar Aug 27 '21 at 21:38
In an earlier comment I gave you a link to a question that explains how to solve the general problem of inflated results when doing aggregation with multiple joins. – Barmar Aug 27 '21 at 21:39
@Rikaelus your sqlfiddle subquery did not use the method described by Barmar and demonstrated in the 3rd query in the answer, which JOINs the basic table to a derived table (subquery in the FROM clause). With that method you would not get 72. – ypercubeᵀᴹ Sep 16 '21 at 13:16
If you have more tables to join, you can still do that (with each one in a separate derived table if aggregates are needed form those tables as well), as his other answer explains: https://stackoverflow.com/questions/37978511/join-tables-with-sum-issue-in-mysql/37979049#37979049 – ypercubeᵀᴹ Sep 16 '21 at 13:18
See also my answer on a similar problem: https://dba.stackexchange.com/questions/17012/help-with-this-query/17016#17016 What I refer to as "Option 3" is exactly what Barmar has in his erd query here and in the other linked SO answer. – ypercubeᵀᴹ Sep 16 '21 at 13:19
@ypercubeᵀᴹ And that might work *if* every other join can realistically be done in a derived table. Any conventional join will result in the numbers being multiplied by the number of additional rows the join causes. As stated in the OP the query includes many joins and is dynamically put together, making any significant change (ie, converting everything to derived tables) problematic. Hence my very specific question to which the answer seems to just be "no." – Rikaelus Sep 16 '21 at 18:36

MySQL SELECT Subquery -- WHERE value IN collapsed group-by values

1 Answers1