1

I have a table for categories with a nested set model. Each row should contain its count of sub-categories and how much articles are in those or '0' if there aren't any.

I've searched arround and found two possible solutions but nothing of them works:
MySQL & nested set: slow JOIN (not using index)
Why isn't MySQL using any of these possible keys?

Create Table categories:

CREATE TABLE `categories` (
  `GROUP_ID` varchar(255) CHARACTER SET utf8 NOT NULL,
  `GROUP_NAME` varchar(255) CHARACTER SET utf8 NOT NULL,
  `PARENT_ID` varchar(255) CHARACTER SET utf8 NOT NULL,
  `TYPE` enum('root','node','leaf') CHARACTER SET utf8 NOT NULL DEFAULT 'node',
  `LEVEL` tinyint(2) NOT NULL DEFAULT '0',
  `GROUP_ORDER` int(11) NOT NULL,
  `GROUP_DESCRIPTION` text CHARACTER SET utf8 NOT NULL,
  `total_articles` int(11) unsigned NOT NULL DEFAULT '0',
  `total_cats` int(11) unsigned NOT NULL DEFAULT '0',
  `lft` smallint(5) unsigned NOT NULL DEFAULT '0',
  `rgt` smallint(5) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`GROUP_ID`),
  KEY `PARENT_ID` (`PARENT_ID`),
  KEY `lft` (`lft`),
  KEY `rgt` (`rgt`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

total_cats is the amount of sub-categories in the rows tree.
The following query will do exactly what i want: all sub-category and article counts. But it is very slow. It takes more than 80 seconds to perform on ~5000 categories and ~40000 articles. The calculation of total_articles is already done by another script. (If there arent any articles, all rows should hold 0 for total_articles)

The Query:

SELECT a.GROUP_ID,a.PARENT_ID,COUNT(b.GROUP_ID) as total_cats,(
   SELECT SUM(c.total_articles)
   FROM categories c
   WHERE c.PARENT_ID = a.GROUP_ID) as total_articles
FROM categories as b
   INNER JOIN categories as a
     ON a.lft < b.lft AND a.rgt > b.rgt
GROUP BY a.GROUP_ID

It results in something like this:

+-------------------------------------------+-------------------------------------+------------+----------------+
| GROUP_ID                                  | PARENT_ID                           | total_cats | total_articles |
+-------------------------------------------+-------------------------------------+------------+----------------+
| 69_69_1                                   | 69_69_0                             |       4252 |              0 |
| 69_69_Abfall__Wertstoffsammler___zubehoer | 69_69_NWEAB290h001                  |          5 |             20 |
| 69_69_Abisolierzangen                     | 69_69_NWAAA458h001                  |          4 |             56 |
| 69_69_Abzieher_2                          | 69_69_NWAAB944h001                  |         23 |            476 |
| 69_69_Abziehvorrichtung                   | 69_69_Abzieher_2                    |          3 |             18 |
| 69_69_Aexte                               | 69_69_NWEAA615h001                  |          6 |             45 |
| 69_69_Alarmgeraete_Melder                 | 69_69_Sicherungstechnik__Heimschutz |          3 |              4 |
| 69_69_Allgemeiner_Industriebedarf         | 69_69_Industrieausruestung          |          8 |             21 |
| 69_69_Allgemeines_Schweisszubehoer        | 69_69_NWEAB113h001                  |         27 |             97 |
| 69_69_Anker__Befestigungstechnik__1       | 69_69_Befestigungstechnik           |          5 |            163 |

The explain if it helps:

+----+--------------------+-------+------+---------------+-----------+---------+------+------+------------------------------------------------+
| id | select_type        | table | type | possible_keys | key       | key_len | ref  | rows | Extra                                          |
+----+--------------------+-------+------+---------------+-----------+---------+------+------+------------------------------------------------+
|  1 | PRIMARY            | b     | ALL  | lft,rgt       | NULL      | NULL    | NULL | 4253 | Using temporary; Using filesort                |
|  1 | PRIMARY            | a     | ALL  | lft,rgt       | NULL      | NULL    | NULL | 4253 | Range checked for each record (index map: 0xC) |
|  2 | DEPENDENT SUBQUERY | c     | ref  | PARENT_ID     | PARENT_ID | 767     | func |    7 | NULL                                           |
+----+--------------------+-------+------+---------------+-----------+---------+------+------+------------------------------------------------+

As you can see, it doesnt use the indexes. If i put FORCE INDEX (lft,rgt) next to the JOIN the query executes, but nothing changes. Also tried to add an index on both columns lft and right:

ALTER TABLE `categories` ADD INDEX `nestedset` (`lft`, `rgt`);

But that doesnt help at all. The query still is slow.

Interestingly: The query is pretty fast if the categories table is just filled with a small amount of rows e.g. 260. But if it reaches 1000+ it will become slower and slower.

Example data with ~4000 categories: http://pastebin.com/BsViwFM5 its a big file!
Thanks for any help and hints!

Community
  • 1
  • 1
UnskilledFreak
  • 306
  • 2
  • 13
  • maybe better asked at dba.stackexchange? – davejal Feb 21 '17 at 14:28
  • maybe u are right, but others asked with similar situations so if anyone want to migrate it, feel free to do so =) – UnskilledFreak Feb 21 '17 at 15:12
  • Incidentally, wherever the word INT appears, the number that follows it is fairly meaningless – Strawberry Feb 22 '17 at 08:29
  • so what should i use instead? Originally the lft and rgt columns was INTs, so i changed it to smallint so mysql doesnt have to use the full INT-scale – UnskilledFreak Feb 22 '17 at 08:55
  • 1
    I would stick with INT throughout - but either way, I doubt it makes much difference to performance. Your choice of a VARCHAR id (for both group and parent) perhaps hinders performance very slightly, but probably not a lot. – Strawberry Feb 22 '17 at 09:40
  • I'll give it a try, unfortunately i have to go with varchars on GROUP_ID and PARENT_ID cause of dependencies :/ Or should i add "real" IDs and just use the actual GROUP_ID as text? – UnskilledFreak Feb 22 '17 at 09:50
  • I would - but it's your design, not mine! – Strawberry Feb 23 '17 at 10:53

2 Answers2

2

What does the EXPLAIN for this look like?

SELECT a.GROUP_ID
     , a.PARENT_ID
     , COUNT(b.GROUP_ID) total_cats
     , c.total_articles
  FROM categories b
  JOIN categories a
    ON a.lft < b.lft 
   AND a.rgt > b.rgt
  JOIN 
     ( SELECT parent_id 
           , SUM(total_articles) total_articles
        FROM categories 
       GROUP 
          BY parent_id
     ) c
    ON c.parent_id = a.GROUP_ID
 GROUP 
    BY a.GROUP_ID
Strawberry
  • 33,750
  • 13
  • 40
  • 57
  • Your query is faster than mine but also runs for about 10 seconds. You can find the explain here: http://pastebin.com/ZH7cTCGn – UnskilledFreak Feb 22 '17 at 08:18
  • I guess I'd try with a covering index on lft,rgt . I can't see that there's that much else you can do - but others might be able to help further – Strawberry Feb 22 '17 at 08:25
  • tested with adding an 2 column index nestedset like above. The explain still doesnt show to use this new key but the query was 1 second faster. Anyways thats a good improvement 80 > 10! – UnskilledFreak Feb 22 '17 at 08:40
  • since there arnt any other solutions to "fix" this, yours is the best and provided an improvement =) Thanks! – UnskilledFreak Feb 23 '17 at 08:30
  • @UnskilledFreak Did you remove the separate index on lft? I cannot understand why the query isn't using the lft-rgt index. – Strawberry Feb 23 '17 at 09:32
  • I tried it right now an removed the "standalone" index on lft and rgt. It just does not use the lft-rgt index, However without those two single indexes the query was even faster, it took just 5 sec, Great! – UnskilledFreak Feb 23 '17 at 10:36
  • That sounds like caching to me. Asking a fresh question (referring back to this one) about why my query doesn't use that index may elicit further response. – Strawberry Feb 23 '17 at 10:48
0

The right-left tree is a cute "textbook" technique. But, as you are finding out, it does not scale for the 'real world'.

The EXPLAIN show that it scanning all of b, then for each such row, it is scanning all of a. That's Order(N^2) -- 5000*5000 = 25 million operations.

Actually, this relatively new operation (Range checked for each record (index map: 0xC)) implies that it is not quite that bad.

The optimizer really can't do much better at finding 'betweenness' because one bit of missing info: whether the ranges are overlapping.

Your task may be better achieved by switching to a hierarchical schema, and "walking" the tree, either in app code, or in a Stored Routine.

With MariaDB 10.2 or MySQL 8.0, you can write a "Recursive CTE" do walk the tree in a single, though complicated, query.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • yeah i thought that its main "problem" is the betweenness. I already had a recursive function tested, done in php (parent->groups etc) but that was even slower. Maybe u could provide an example query please? – UnskilledFreak Feb 22 '17 at 08:23
  • 1
    How could the ranges overlap !?!? – Strawberry Feb 22 '17 at 08:26