2

I am working with eBay categories, I'm looking for the most efficient way to retrieve a list of matching "leaf categories"(top level only) with their complete breadcrumb when given a term which matches part of the category name

Here is a sqlfiddle I've been working with.

Assuming I only had two Leaf categories (Post-1900 and Pre-1900)

Here are their breadcrumbs

Antiques > Antique Clocks > Bracket Clocks > Post-1900
Antiques > Antique Clocks > Bracket Clocks > Pre-1900

If the term "Bracket" is used then the results would contain two rows, one for each breadcrumb, but if "Post-19" is the term then only one row would be returned. Each row should contain two fields CategoryID and breadcrumb, the CategoryID must be the "leaf category".

CREATE TABLE `ebay_categories` (
  `CategoryID` int(11) DEFAULT NULL,
  `CategoryName` varchar(20) DEFAULT NULL,
  `CategoryParentID` int(11) DEFAULT NULL,
  `CategoryLevel` int(11) DEFAULT NULL
);

insert into `ebay_categories` (`CategoryID`, `CategoryName`, `CategoryParentID`, `CategoryLevel`) values('20081','Antiques','20081','1');
insert into `ebay_categories` (`CategoryID`, `CategoryName`, `CategoryParentID`, `CategoryLevel`) values('13851','Antique Clocks','20081','2');
insert into `ebay_categories` (`CategoryID`, `CategoryName`, `CategoryParentID`, `CategoryLevel`) values('100904','Bracket Clocks','13851','3');
insert into `ebay_categories` (`CategoryID`, `CategoryName`, `CategoryParentID`, `CategoryLevel`) values('96762','Post-1900','100904','4');
insert into `ebay_categories` (`CategoryID`, `CategoryName`, `CategoryParentID`, `CategoryLevel`) values('66840','Pre-1900','100904','4');

I'm trying to implement the same method used here but have been failing miserably.

SELECT LeafID as CategoryID, GROUP_CONCAT(CategoryName SEPARATOR ' > ') AS breadcrumb FROM (

    (SELECT CategoryID as LeafID AS 

       SELECT * from ebay_categories WHERE CategoryName LIKE '%Antiq%')  AS c

  ) AS b GROUP BY LeafID

) AS a ORDER BY breadcrumb ASC Limit 20
Community
  • 1
  • 1
TarranJones
  • 4,084
  • 2
  • 38
  • 55

1 Answers1

0

SQL can only (without being "clever" using stored procedure etc) ever return a fixed number of columns.

(FYI - the answer you link to, IMHO is being "clever". It's doing some really in-obvious things with session variables etc, which if they don't hurt performance are hurting readability - so I'm going to try and answer your question from a different angle.)

You could therefore fix (hardcode) the breadcrumb "depth", and with a fixed number of JOIN statements everything becomes very simple.

I'm assuming the breadcrumb depths are intended to be anything between 1 and infinity? i.e. another "collection" of items may be filed under a smaller depth of categories?

This being the case, your GROUP_CONCAT is probably one part of the solution, because it emulates "variable column counts" in SQL. (It returns as 1 column, but can contain a flexible number of delimited values inside.)

Your problem will be that still the nature of SQL can only join one table to another (single) table per JOIN statement. Your breadcrumb data structure is well normalised, and assumes a join per each sub-category its parent.

You could try to dynamically build the SQL - but that's probably going to burn you. You're probably left with two "obvious" options:

  1. Stored procedure. (Go beyond basic SQL.)
  2. Change your schema. (Solve the problem when the data is stored, rather than retrieved.)

Stored procedures could solve this in a number of ways - one obvious option is to iterate through building each breadcrumb procedurally, storing reach in a temporary table, then finally selecting the entire temporary table.

I'm happy to give you guidance on this but I won't do it as part of this answer (unless requested to) because I'm fairly sure the performance will be sufficiently bad that you ultimately won't want to use it.

The other "major" option then is to restructure the schema. In this scenario the level of normalisation you've achieved is making things unduly complex. It's good "academically" and it's good for disk space. But it's not lending itself to solving your problem very well!

Denormalising has one other major trade-off. You'll have more complexity when changing the data in the schema. I'd recommend starting off by writing a routine which "rebuilds" the data (if you take this approach), because otherwise things will get out of sync and you'll spend forever trying to work out what went wrong. (I speak from experience.)

For every matching record (you're comparing user input to CategoryName) you want to return and be able to group by everything which precedes it in the tree. And without doing "clever" stuff.

One (of several) denormalisation approaches would be to maintain a depth * width long list of leaves to ancestors. (Like I said, it's not storage efficient. You'll have to evaluate if that is a problem in a production scenario.) For your example data, it would look like this:

+------------+--------+
| AncestorId | LeafId |
+------------+--------+
| 20081      | 66840  |
| 20081      | 96762  |
| 13851      | 66840  |
| 13851      | 96762  |
| 100904     | 66840  |
| 100904     | 96762  |
| 66840      | 66840  |
| 96762      | 96762  |
+------------+--------+

Thus now you can do something like this:

CREATE TABLE `tree_branches` (
  `AncestorId` int(11) NOT NULL,
  `LeafId` int(11) NOT NULL
);

INSERT INTO `tree_branches` SET `AncestorId`=20081, `LeafId`=66840;
INSERT INTO `tree_branches` SET `AncestorId`=20081, `LeafId`=96762;
INSERT INTO `tree_branches` SET `AncestorId`=13851, `LeafId`=66840;
INSERT INTO `tree_branches` SET `AncestorId`=13851, `LeafId`=96762;
INSERT INTO `tree_branches` SET `AncestorId`=100904, `LeafId`=66840;
INSERT INTO `tree_branches` SET `AncestorId`=100904, `LeafId`=96762;
INSERT INTO `tree_branches` SET `AncestorId`=66840, `LeafId`=66840;
INSERT INTO `tree_branches` SET `AncestorId`=96762, `LeafId`=96762;

SELECT
  GROUP_CONCAT(`breadCrumbCategories`.`CategoryName` SEPARATOR " > ")
FROM `ebay_categories` AS `matchedCategory`
INNER JOIN `tree_branches` AS `matchedCategoryLeaves` ON (`matchedCategoryLeaves`.`AncestorId` = `matchedCategory`.`categoryId`)
INNER JOIN `tree_branches` AS `breadCrumbs` ON (`breadCrumbs`.`LeafId` = `matchedCategoryLeaves`.`LeafId`)
INNER JOIN `ebay_categories` AS `breadCrumbCategories` ON (`breadCrumbCategories`.`CategoryId` = `breadCrumbs`.`ancestorId`)
WHERE
  `matchedCategory`.`CategoryName` LIKE "Post%"
GROUP BY
  `breadCrumbs`.`LeafId`
  ;

You should add some sort of sorting for the GROUP_BY to ensure it doesn't do something implicitly unexpected. You could (for example) maintain a level ID for that purpose.

Update: Once you have understood what I've done above, you should test it with LIKE 'Ant%' and observe the erroneous output. Add a second GROUP BY clause and a DISTINCT to solve the problem caused by user queries matching multiple crumbs which are ancestors to the same leaf.

SELECT
  DISTINCT GROUP_CONCAT(`breadCrumbCategories`.`CategoryName` SEPARATOR " > ")
FROM `ebay_categories` AS `matchedCategory`
INNER JOIN `tree_branches` AS `matchedCategoryLeaves` ON (`matchedCategoryLeaves`.`AncestorId` = `matchedCategory`.`categoryId`)
INNER JOIN `tree_branches` AS `breadCrumbs` ON (`breadCrumbs`.`LeafId` = `matchedCategoryLeaves`.`LeafId`)
INNER JOIN `ebay_categories` AS `breadCrumbCategories` ON (`breadCrumbCategories`.`CategoryId` = `breadCrumbs`.`ancestorId`)
WHERE
  `matchedCategory`.`CategoryName` LIKE "An%"
GROUP BY
  `breadCrumbs`.`LeafId`,
  `matchedCategory`.`CategoryId`
  ;
wally
  • 3,492
  • 25
  • 31