1

I've looked at a lot of questions trying to figure this out and finally got a query that does what I want. But I'm wondering if there's a better way of doing this? Essentially, I have a bunch of tables that get joined together due to a belongs_to relationship and I want the number of rows being returned to only be as many as the largest number of rows in one of these belongs_to tables.

Since that's kind of awkwardly worded, here's an example. Say I have 2 recipes and each have steps and some nutrition. Recipe 1 has 3 steps and 3 nutritions, and recipe 2 has 2 steps and 4 nutritions. For Recipe 1 at most 3 rows should be returned, and for recipe 2 at most 4 rows should be returned. Here's a fiddle with the data: http://sqlfiddle.com/#!9/bcce59/2

In case the fiddle doesn't work for some reason here's the table schema:

CREATE TABLE recipe
    (`id` int PRIMARY KEY, `title` varchar(64))
;

CREATE TABLE step
    (`rid` int, `instruction` varchar(64),
    FOREIGN KEY(rid) REFERENCES recipe(id) )
;

CREATE TABLE nutrition
    (`rid` int, `name` varchar(64), `amount` int,
    FOREIGN KEY(rid) REFERENCES recipe(id) )
;

And here's some sample data:

INSERT INTO recipe
    (`id`, `title`)
VALUES
    (1, 'Cookies'),
    (2, 'Bananas')
;

INSERt INTO step
    (`rid`, `instruction`)
VALUES
    (1, 'Unwrap'),
    (1, 'Dip in milk'),
    (1, 'Eat'),
    (2, 'Peal'),
    (2, 'Eat')
;

INSERT INTO nutrition
    (`rid`, `name`, `amount`)
VALUES
    (1, 'calories', 120),
    (1, 'sugar', 300),
    (1, 'fat', 50),
    (2, 'calories', 50),
    (2, 'sugar', 50),
    (2, 'fat', 20),
    (2, 'carb', 30)
;

Now, I thought I might be able to do this with a group by at first. But things like

SELECT id, title, instruction, name, amount FROM
recipe 
LEFT JOIN step ON recipe.id = step.rid
LEFT JOIN nutrition on recipe.id = nutrition.rid
GROUP BY id, instruction, name, amount;

Will return 17 rows since it's a product and the number of unique pairings of the group by columns is 9 for recipe 1 and 8 for recipe 2. So that's out. After a lot of searching amongst the tags and pouring over MySQL documentation and a cookbook book I have, I came up with the following query which does do the job:

SELECT id, title, instruction, name, amount FROM 
(
    SELECT 
        id, 
        title, 
        instruction, 
        name, 
        amount
    FROM recipe 
        LEFT JOIN step ON recipe.id = step.rid
        LEFT JOIN nutrition on recipe.id = nutrition.rid
) data
INNER JOIN 
(
    SELECT 
        s.rid,
        CASE
            WHEN
                GREATEST(numSteps, numNutrition) = numSteps
            THEN instruction
            WHEN
                GREATEST(numSteps, numNutrition) = numNutrition
            THEN name
        END as row
    FROM 
    (
        SELECT
            rid,
            instruction
        FROM step GROUP BY rid, instruction 
    ) s
    LEFT JOIN
        (
            SELECT
                rid,
                name
            FROM nutrition GROUP BY rid, name
        ) n
    ON s.rid = n.rid
    LEFT JOIN
        (
            SELECT rid, COUNT(*) as numNutrition 
            FROM nutrition GROUP BY rid
        ) nSum
    ON n.rid = nSum.rid
    LEFT JOIN
        (
            SELECT rid, COUNT(*) as numSteps 
            FROM step GROUP BY rid
        ) sSum
    ON s.rid = sSum.rid
    GROUP by rid, row
) biggest 
ON data.id = biggest.rid 
GROUP BY data.id, biggest.row
;

However, generalizing my baby example of 2 belongs_to tables to my actual database which has more than 20 tables to join makes me concerned. My real data has between 15k - 90k rows per 'recipe' when using the naive joining method, so I'm concerned about both the performance of the query and that I might just be missing something very basic and simple that will help solve this problem. I don't really want to write a stored procedure to do this, though I do wonder if a view table would make sense? My question is

  • Is there a way to write the above query in a better/more performant way?
  • Would it make sense to construct a view table or something like that to effectively cache the results of a possibly long and painful query?

Apologies for the odd question title, I'm not sure how to succinctly phrase what I'm doing with this query.

I realized my fiddle isn't producing the right data, so here's an edit to make it clear what the final result set of a query should be:

+----+---------+-------------+----------+--------+
| id | title   | instruction | name     | amount |
+----+---------+-------------+----------+--------+
|  1 | Cookies | Unwrap      | calories |    120 |
|  1 | Cookies | Dip in milk | sugar    |    300 |
|  1 | Cookies | Eat         | fat      |     50 |
|  2 | Bananas | Peel        | calories |     50 |
|  2 | Bananas | Peel        | sugar    |     50 |
|  2 | Bananas | Eat         | fat      |     20 |
|  2 | Bananas | Eat         | carb     |     30 |
+----+---------+-------------+----------+--------+
7 rows in set (0.00 sec)

Something like this, with each instruction/nutrition value appearing at least once in the result set. With duplicates allowed for those columns which do not have the largest number of rows compared to the other belongs_to tables.

EdgeCaseBerg
  • 2,761
  • 1
  • 23
  • 39
  • 1
    Use MySQL to get the data. Use some other language to format it. – Paul Spiegel Aug 11 '17 at 16:40
  • 1
    This is my kind of cooking. Just about my level. But I'd 'peel' the banana. – Strawberry Aug 11 '17 at 16:41
  • @PaulSpiegel Unfortunately, trying to process this many rows at the application layer is causing issues which is why we're going to the MySQL layer to see if we can fix it there. For 20 items we get 400k+ rows, which cause OOM errors if we let mySQL buffer, but if we switch to streaming with fetch size = min then it takes too long to send the data back to the application and connections time out. – EdgeCaseBerg Aug 11 '17 at 16:41
  • Seems that you are doing something wrong in your application layer. But that is no reason to do it (wrong) in SQL. What is your expected result from the sample data? And what is your application language? – Paul Spiegel Aug 11 '17 at 16:48
  • @PaulSpiegel The fiddle is close, returning 3 rows for one recipe and 4 rows for the other. Though I just realized that the instruction/nutrition in the result set is incorrect and being grouped improperly so I'm going to look at that again. I've also considered bumping the fetch size up and using the server side cursor option so that it's still streaming but with larger sets. But If I can cut down the amount of semi-duplicative data sent to me in the first place it will help tons – EdgeCaseBerg Aug 11 '17 at 16:53
  • You are only telling how many rows should be returned. You are not telling what the rows (and columns) should contain. I could give you a query, that would return the correct number of rows. But it wouldn't solve your problem. So again: what is your expected result (as a table)? – Paul Spiegel Aug 11 '17 at 17:05
  • @PaulSpiegel edited the question with an example result set of what I'm trying to get the fiddle to do. – EdgeCaseBerg Aug 11 '17 at 17:11
  • As far as I anderstand your issue, you just need [three simple queries](http://sqlfiddle.com/#!9/bcce59/13). You can then combine the data in your application layer (and format in your view layer). – Paul Spiegel Aug 11 '17 at 17:22
  • Thanks @PaulSpiegel. We've been considering that as a possible solution but were concerned about network performance between the database and the app server. Still, looking over answers in https://stackoverflow.com/a/5876241/1808164 would seem to suggest that our 20 something tables may be putting us into the no-join territory and we should use separate queries. We did some benchmarking before that suggested the join was faster but that may have been a bad benchmark since it wasn't ran on the servers but on local machines... – EdgeCaseBerg Aug 11 '17 at 17:35
  • 1
    Note: This is not because JOINs are bad. This is because joins like that don't make a logical sence. With `LEFT JOIN step ON recipe.id = step.rid LEFT JOIN nutrition on recipe.id = nutrition.rid` you are creating a CROSS JOIN within the group. You could use subqueries and `group_concat` like [here](http://rextester.com/CQFI68736). But you would get a comlex code and put your output format in SQL. I would try to avoid both. – Paul Spiegel Aug 11 '17 at 17:57

0 Answers0