0

I have the following tables (minified for the sake of simplicity):

CREATE TABLE IF NOT EXISTS `product_bundles` (
  bundle_id int AUTO_INCREMENT PRIMARY KEY,
  -- More columns here for bundle attributes
) ENGINE=InnoDB;

CREATE TABLE IF NOT EXISTS `product_bundle_parts` (
  `part_id` int AUTO_INCREMENT PRIMARY KEY,
  `bundle_id` int NOT NULL,
  `sku` varchar(255) NOT NULL,
  -- More columns here for product attributes
  KEY `bundle_id` (`bundle_id`),
  KEY `sku` (`sku`)
) ENGINE=InnoDB;

CREATE TABLE IF NOT EXISTS `products` (
  `product_id` mediumint(8) AUTO_INCREMENT PRIMARY KEY,
  `sku` varchar(64) NOT NULL DEFAULT '',
  `status` char(1) NOT NULL default 'A',
  -- More columns here for product attributes
  KEY (`sku`),
) ENGINE=InnoDB;

And I want to show only the 'product bundles' that are currently completely in stock and defined in the database (since these get retrieved from a third party vendor, there is no guarantee the SKU is defined). So I figured I'd need an anti-join to retrieve it accordingly:

SELECT SQL_CALC_FOUND_ROWS *
FROM product_bundles AS bundles
WHERE 1
  AND NOT EXISTS (
    SELECT *
    FROM product_bundle_parts AS parts
    LEFT JOIN products AS products ON parts.sku = products.sku
    WHERE parts.bundle_id = bundles.bundle_id
      AND products.status = 'A'
      AND products.product_id IS NULL
  )
-- placeholder for other dynamic conditions for e.g. sorting
LIMIT 0, 24

Now, I sincerely thought this would filter out the products by status, however, that seems not to be the case. I then changed one thing up a bit, and the query never finished (although I believe it to be correct):

SELECT SQL_CALC_FOUND_ROWS *
FROM product_bundles AS bundles
WHERE 1
  AND NOT EXISTS (
    SELECT *
    FROM product_bundle_parts AS parts
    LEFT JOIN products AS products ON parts.sku = products.sku
      AND products.status = 'A'
    WHERE parts.bundle_id = bundles.bundle_id
      AND products.product_id IS NULL
  )
-- placeholder for other dynamic conditions for e.g. sorting
LIMIT 0, 24

Example data:

product_bundles
bundle_id | etc.
1         |
2         |
3         |

product_bundle_parts
part_id | bundle_id | sku
1       | 1         | 'sku11'
2       | 1         | 'sku22'
3       | 1         | 'sku33'
4       | 1         | 'sku44'
5       | 2         | 'sku55'
6       | 2         | 'sku66'
7       | 3         | 'sku77'
8       | 3         | 'sku88'

products
product_id | sku     | status
101        | 'sku11' | 'A'
102        | 'sku22' | 'A'
103        | 'sku33' | 'A'
104        | 'sku44' | 'A'
105        | 'sku55' | 'D'
106        | 'sku66' | 'A'
107        | 'sku77' | 'A'
108        | 'sku99' | 'A'

Example result: Since the product status of product #105 is 'D' and 'sku88' from part #8 was not found:

bundle_id | etc.
1         |

I am running Server version: 10.3.25-MariaDB-0ubuntu0.20.04.1 Ubuntu 20.04

So there are a few questions I have.

  1. Why does the first query not filter out products that do not have the status A.
  2. Why does the second query not finish?
  3. Are there alternative ways of achieving the same thing in a more efficient matter, as this looks rather cumbersome.
Harm Smits
  • 437
  • 3
  • 10
  • Which is the meaning of the "A" status? Does it mean that the SKU is defined? – Ma3x Apr 09 '21 at 13:21
  • @FrancescoMunao' Its an enumerable. It simply means 'Active' in this case. We have a generic 'ObjectStatuses' enumerable that defines a variety of other ones as well. The product is guaranteed to be in stock at that point, so no other checks necessary (this is done for concurrency). – Harm Smits Apr 09 '21 at 13:24

1 Answers1

1

First of all, I've read that SQL_CALC_FOUND_ROWS * is much slower than running two separate query (COUNT(*) and then SELECT * or, if you make your query inside another programming language, like PHP, executing the SELECT * and then count the number of rows of the result set)

Second: your first query returns all the boundles that doesn't have ANY active products, while you need the boundles with ALL products active. I'd change it in the following:

    SELECT SQL_CALC_FOUND_ROWS *
    FROM product_bundles AS bundles
    WHERE NOT EXISTS (
        SELECT 'x'
        FROM product_bundle_parts AS parts
            LEFT JOIN products ON (parts.sku = products.sku)
        WHERE parts.bundle_id = bundles.bundle_id
        AND COALESCE(products.status, 'X') != 'A'
    )
-- placeholder for other dynamic conditions for e.g. sorting
LIMIT 0, 24

I changed the products.status = 'A' in products.status != 'A': in this way the query will return all the boundles that DOESN'T have inactive products (I also removed the condition AND products.product_id IS NULL because it should have been in OR, but with a loss in performance). You can see my solution in SQLFiddle.

Finally, to know why your second query doesn't end, you should check the structure of your tables and how they are indexed. Executing an Explain on the query could help you to find eventual issues on the structure. Just put the keyword EXPLAIN before the SELECT and you'll have your "report" (EXPLAIN SELECT * ....).

Ma3x
  • 516
  • 6
  • 19
  • Thanks for your reply, however this still does not do what I originally meant, please take a look at the added example data. Also, `SQL_CALC_ROWS` is faster for me, than using a `select count()`. – Harm Smits Apr 09 '21 at 14:47
  • @codam_hsmits, I've edited the answer, also with a SQLFiddle example. Now the query give as result only bundle #1, as `Sku55` from bundle #2 is not 'A' and `Sku88` in bundle 3 is not defined in the database. Let me know if this helps. :) – Ma3x Apr 09 '21 at 15:15
  • That does work, thanks! However, the performance is still pretty bad. I suppose I should just check if the products exist on insertion and add an extra status. By implementing a quick hotfix for this query time went down from 1.1 second, to just .004 seconds. I checked, everything was going over indices, but the query optimizer can't handle it. Any idea for perhaps another implementation that would not require me to do the aforementioned? – Harm Smits Apr 09 '21 at 17:10
  • @codam_hsmits If you execute the EXPLAIN on that query which is the part that "costs" most? Maybe the COALESCE should be the slower part, as it calls a function on each row of table `products`. But, without it, you can't check if a bundle contains a products that doesn't exists in table `products`. An alternative could be to execute 2 differente queries: the first to check if all products of a bundle exists and the second to check the flag "A" – Ma3x Apr 13 '21 at 09:08