I've looked at a lot of questions trying to figure this out and finally got a query that does what I want. But I'm wondering if there's a better way of doing this? Essentially, I have a bunch of tables that get joined together due to a belongs_to relationship and I want the number of rows being returned to only be as many as the largest number of rows in one of these belongs_to tables.
Since that's kind of awkwardly worded, here's an example. Say I have 2 recipes and each have steps and some nutrition. Recipe 1 has 3 steps and 3 nutritions, and recipe 2 has 2 steps and 4 nutritions. For Recipe 1 at most 3 rows should be returned, and for recipe 2 at most 4 rows should be returned. Here's a fiddle with the data: http://sqlfiddle.com/#!9/bcce59/2
In case the fiddle doesn't work for some reason here's the table schema:
CREATE TABLE recipe
(`id` int PRIMARY KEY, `title` varchar(64))
;
CREATE TABLE step
(`rid` int, `instruction` varchar(64),
FOREIGN KEY(rid) REFERENCES recipe(id) )
;
CREATE TABLE nutrition
(`rid` int, `name` varchar(64), `amount` int,
FOREIGN KEY(rid) REFERENCES recipe(id) )
;
And here's some sample data:
INSERT INTO recipe
(`id`, `title`)
VALUES
(1, 'Cookies'),
(2, 'Bananas')
;
INSERt INTO step
(`rid`, `instruction`)
VALUES
(1, 'Unwrap'),
(1, 'Dip in milk'),
(1, 'Eat'),
(2, 'Peal'),
(2, 'Eat')
;
INSERT INTO nutrition
(`rid`, `name`, `amount`)
VALUES
(1, 'calories', 120),
(1, 'sugar', 300),
(1, 'fat', 50),
(2, 'calories', 50),
(2, 'sugar', 50),
(2, 'fat', 20),
(2, 'carb', 30)
;
Now, I thought I might be able to do this with a group by at first. But things like
SELECT id, title, instruction, name, amount FROM
recipe
LEFT JOIN step ON recipe.id = step.rid
LEFT JOIN nutrition on recipe.id = nutrition.rid
GROUP BY id, instruction, name, amount;
Will return 17 rows since it's a product and the number of unique pairings of the group by columns is 9 for recipe 1 and 8 for recipe 2. So that's out. After a lot of searching amongst the tags and pouring over MySQL documentation and a cookbook book I have, I came up with the following query which does do the job:
SELECT id, title, instruction, name, amount FROM
(
SELECT
id,
title,
instruction,
name,
amount
FROM recipe
LEFT JOIN step ON recipe.id = step.rid
LEFT JOIN nutrition on recipe.id = nutrition.rid
) data
INNER JOIN
(
SELECT
s.rid,
CASE
WHEN
GREATEST(numSteps, numNutrition) = numSteps
THEN instruction
WHEN
GREATEST(numSteps, numNutrition) = numNutrition
THEN name
END as row
FROM
(
SELECT
rid,
instruction
FROM step GROUP BY rid, instruction
) s
LEFT JOIN
(
SELECT
rid,
name
FROM nutrition GROUP BY rid, name
) n
ON s.rid = n.rid
LEFT JOIN
(
SELECT rid, COUNT(*) as numNutrition
FROM nutrition GROUP BY rid
) nSum
ON n.rid = nSum.rid
LEFT JOIN
(
SELECT rid, COUNT(*) as numSteps
FROM step GROUP BY rid
) sSum
ON s.rid = sSum.rid
GROUP by rid, row
) biggest
ON data.id = biggest.rid
GROUP BY data.id, biggest.row
;
However, generalizing my baby example of 2 belongs_to tables to my actual database which has more than 20 tables to join makes me concerned. My real data has between 15k - 90k rows per 'recipe' when using the naive joining method, so I'm concerned about both the performance of the query and that I might just be missing something very basic and simple that will help solve this problem. I don't really want to write a stored procedure to do this, though I do wonder if a view table would make sense? My question is
- Is there a way to write the above query in a better/more performant way?
- Would it make sense to construct a view table or something like that to effectively cache the results of a possibly long and painful query?
Apologies for the odd question title, I'm not sure how to succinctly phrase what I'm doing with this query.
I realized my fiddle isn't producing the right data, so here's an edit to make it clear what the final result set of a query should be:
+----+---------+-------------+----------+--------+
| id | title | instruction | name | amount |
+----+---------+-------------+----------+--------+
| 1 | Cookies | Unwrap | calories | 120 |
| 1 | Cookies | Dip in milk | sugar | 300 |
| 1 | Cookies | Eat | fat | 50 |
| 2 | Bananas | Peel | calories | 50 |
| 2 | Bananas | Peel | sugar | 50 |
| 2 | Bananas | Eat | fat | 20 |
| 2 | Bananas | Eat | carb | 30 |
+----+---------+-------------+----------+--------+
7 rows in set (0.00 sec)
Something like this, with each instruction/nutrition value appearing at least once in the result set. With duplicates allowed for those columns which do not have the largest number of rows compared to the other belongs_to tables.