Single table or Multiple tables for hierarchy data

Question

I must implement this following hierarchy data:

Category (id, name, url)
SubCategory (id, name, url)
SubSubCategory (id, name, url)

NOTE that this is many-to-many relationship. EG: Each node can have multiple parents or children. There will be no circulation relationship (thank GOD). Only some SubSubCategory may belong to multiple SubCategory.

My implementation: I use single table for this

Cat (id, type(category, subcategory, subsubcategory), name, url)
CatRelation (id, parent_id, child_id, pre_calculated_index for tree retrieval)

pre_calculated_index can be left right implementation of modified preorder tree traversal [1, 2] or a path in my implementation. This pre_calculated_index is calculated when adding child to one node so that when you retrieve a tree you only need to sort by this field and avoid recursive query.

Anyway my boss argued that this implementation is not ideal. He suggests having each table for each type of category, and then have a pivot tables to link them:

Category (id, name, url)
SubCategory (id, name, url)
SubSubCategory (id, name, url)
Category_SubCategory(category_id, sub_category_id)
SubCategory_SubSubCategory(sub_category_id, sub_sub_category_id)

When you retrieve a tree, you only need to join all tables. His arguments is that later when you add some attribute to any category type you don't need and null field in single table implementation. And the pre_calculated_index may get wrong since it is calculated in code.

Which one should I follow? Which has better performance?

I use django and postgreSQL.

PS: More detail on my pre_calculated_index implementation: Instead of left and right for each node I add a path (string, unique, indexed) value to the CatRelation: root node will have `path = '.' child node when added to CatRelation will have path = parent_path + '.' So when you sort by this path, you get everything in tree order. Examples:

Cat

| id | name       | url |
|----|------------|-----|
| 1  | Cat1       |     |
| 2  | Subcat1    |     |
| 3  | Subcat2    |     |
| 4  | Subcat3    |     |
| 5  | Subsubcat1 |     |
| 6  | Subsubcat2 |     |
| 7  | Subsubcat3 |     |

CatRelationship                                        Left right equivalent

| id    | parent_id     | child_id  | path      |           |lft |rght|            
|----   |-----------    |---------- |--------   |           |----|----|
| 1     | null          | 1         | 1.        |           | 1  | 14 |
| 2     | 1             | 2         | 1.2.      |           | 2  | 3  |
| 3     | 1             | 3         | 1.3.      |           | 4  | 11 |
| 4     | 1             | 4         | 1.4.      |           | 12 | 13 |
| 5     | 3             | 5         | 1.3.5.    |           | 5  | 6  |
| 6     | 3             | 6         | 1.3.6.    |           | 7  | 8  |
| 7     | 3             | 7         | 1.3.7.    |           | 9  | 10 |

So when you sort by path (or order by left in modified preorder tree), you will got this nice tree structure without recursion:

| id    | parent_id     | child_id  | path      |
|----   |-----------    |---------- |--------   |
| 1     | null          | 1         | 1.        |
| 2     | 1             | 2         | 1.2.      |
| 3     | 1             | 3         | 1.3.      |
| 5     | 3             | 5         | 1.3.5.    |
| 6     | 3             | 6         | 1.3.6.    |
| 7     | 3             | 7         | 1.3.7.    |
| 4     | 1             | 4         | 1.4.      |

And I can always build path dynamically using recursion:

WITH RECURSIVE CTE AS (
    SELECT R1.*, CONCAT(R1.id, ".") AS dynamic_path 
    FROM CatRelation AS R1
    WHERE R1.child_id = request_id
UNION ALL
    SELECT R2.*, CONCAT(dynamic_path, R2.child_id, ".") AS dynamic_path
    FROM CTE 
    INNER JOIN CatRelation AS R2 ON (CTE.child_id = R2.parent_id)      
)
SELECT * FROM CTE;

This is not inheritance as someone suggested

Performance depends a lot on the use case, like if you want to optimize for querying or insertion and the queries you would like to use... Did you have a look at django-mptt and django-treebeard as they offer ready-made well working implementations of modified pre-order traversal and materialized path? — Bernhard Vallant, Mar 08 '19 at 12:37
@BernhardVallant Thank you, the problem is the requirement is having multiple parent (so it look like a graph, not tree). I alredy use django-mptt before — James, Mar 08 '19 at 21:10
Please don't use `Cat` and `CatRelationship`. Cat is [cat](https://en.wikipedia.org/wiki/Cat) and category can be a [category](https://en.wikipedia.org/wiki/Category). There is really no single reason to abbreviate. — cezar, Mar 11 '19 at 11:50
Possible duplicate of [How can you represent inheritance in a database?](https://stackoverflow.com/questions/3579079/how-can-you-represent-inheritance-in-a-database) — philipxy, Sep 30 '19 at 08:59
This is a faq. Before considering posting please always google your error message or many clear, concise & precise phrasings of your question/problem/goal, with & without your particular strings/names & site:stackoverflow.com & tags, & read many answers. If you post a question, use one phrasing as title. See [ask] & the voting arrow mouseover texts. We cannot reason, communicate or search unless we make the effort to (re-re-re-)write clearly. — philipxy, Sep 30 '19 at 09:00
There's no such thing as "better"/"best" in engineering unless *you* define it. Also unfortunately all reasonable practical definitions require a ridiculous amount of experience with a ridiculous number of factors that interact with chaotic sensitivity to details. Make straightforward designs. When you demonstrate via measurement that a design and all alternatives you can think of have problems (whatever that means at the time), then ask a very specific question. Which should also define "better"/"best". [Strategy for “Which is better” questions](https://meta.stackexchange.com/q/204461) — philipxy, Sep 30 '19 at 09:01
@philipxy agree there is no better/best but this is not a duplicate question — James, Sep 30 '19 at 18:15

cezar · Accepted Answer · 2019-09-29T19:05:40.177

Your question is somewhat opinionated because you ask for a comparison of two different approaches. I'll try to provide an answer although I'm afraid there is no unique true answer to it. In the rest of the answer I'll refer to your approach as solution A and to the approach suggested by your boss as solution B.

I would strongly suggest to follow the approach proposed by your boss:

because he's your boss! If something goes wrong later, nobody can blame you. You have followed the instructions.
because it follows the "The Zen of Python".

In particular the following rules of The Zen of Python apply:

Explicit is better than implicit.
The solution B is very explicit. The solution A is implicit.
Simple is better than complex.
The solution B is very simple and straightforward. The solution A is complex.
Sparse is better than dense.
The solution B is sparse. The solution A is dense and hides the obvious from the user.
Readability counts.
The solution B is very verbose, yet easy to read. The solution A requires more time and effort to understand.

You might measure performance in ms, your boss eventually thinks about performance in $. Getting a junior developer on board would require far less time with solution B. Time is expensive for enterprises.

Future changes in the models can be easier implemented. What if you'd like to add another field to Category which shouldn't (or doesn't need) to be present in SubCategory and SubSubCategory?

Testing (unit and functional) is much easier with solution B. It would require eventually more lines of code and be more verbose, but would be easier to read and understand.

The performance will vary and depend on the use case. How many records you'll have in the database? What's more critical: retrieving or inserting/updating? What makes the earlier more performant, might deteriorate the latter and vice versa.

I hope you have heard the sentence:

Premature optimization is the root of all evil.

given by Donald Knuth.

You'll take care about performance when there are concrete issues regarding it. It doesn't mean you shouldn't not invest any forethought concerning performance when designing your application.

You can cache queries, an option would be to use redis. Since you use PostgreSQL you could also use materialized views. But as I said I'd cross that bridge when I come to it.

EDIT: You didn't mention anything else about any further models. I'd assume that when you have categories you'll have some entities, let's say products classified in those categories i.e. categorized. Here I'd give an example:

Category: Men
SubCategory: Sportswear
SubSubCategory: Running Shoes
Product: ACME speeedVX13 (fictive brand and model)

If you strictly follow this hieararchy and put a product only and only in SubSubCategory then the solution B is better.

But if you have a fictive product Sportskit ACME (running shoes, shorts and sleeveless shirt) that you can't put in SubSubCategory and need to put in SubCategory, skipping one level, then you might end up using something like generic relations.
In that case solution A is better.

Single table or Multiple tables for hierarchy data

1 Answers1