3

Since I'm using PostgreSQL there is a module which is called ltree, which satisfies at least one of my needs, performance (I don't know about scalability? Someone says materialized path trees does not scale well..).

Since the application I'm developing is a CMS built entirely around a big tree, nodes, subtrees etc performance in queuering these nodes is absolutely essential, but since it's a hiearchical large (as it grows) tree you're working on and manipulating from the GUI (CRUD), I also want to make it possible for users to drag and drop to reorder nodes, subtrees etc while updating the tree (child records) in the database correctly.

As I understand moving and reordering nodes/subtrees in a tree is not really what ltree/materialized path trees are good for, so what I hope you can help be with is to either point me to the correct tree-structure-model that is best for performance AND moving subtrees and nodes, or perhaps... if ltree is indeed not a leftover from the past but worth still using, how could you achieve this with PostgreSQL's ltree module? And why/why not use ltree in this case?

Requirements:

  1. Query performance is of course my top priority (all nodes, subtrees, leafs).
  2. The tree should support deep level nesting, and sorting
  3. And of course the tree should have support for growing large and scaling big
  4. I can live with a little waiting time while reordering from the GUI, if 1 "jack-of-all-trades" tree implementation doesn't exist, or is too complex for being worth it.

I'm also considering the Closure tables aka Bridge tables (alot!), Nested Intervals (not sure I understand exactly how to implement it, and no good examples or gists currently exists?) or B-tree models. I'm just not quite sure yet, how these will satisfy my above 4 requirements. Re-organizing subtrees and nodes in nested intervals seems straightforward and performance seems good.. Quite hard to choose the right one to go with.

Since I definitely need performance (query / read performance), scalability, sorting I kinda thought that Closure tables WITH sort order could be very close, but I just cant imagine how big the closure tables and disk-space-overhead will become as my tree and nodes grow large. Closure tables and scalability, I'm just not too sure of. Am I wrong in worrying about this, and what might the best solution for this task be?

Community
  • 1
  • 1
Dac0d3r
  • 2,176
  • 6
  • 40
  • 76

1 Answers1

5

The typical data structures used to index trees stored in SQL are designed and optimized for read performance on sets that don't change often.

As an example, if you're using the nested set model, adding or deleting a node would involve updating the entire tree (which typically means rewriting the entire table): great for reads, not so great for writes.

When write performance is important for you, you'll usually be better off working on the raw (id, parent_id) tuples with recursive queries, while setting tree indexes you know for sure are dirty to null as you go. In those areas of the app where read-performance is more important, do a sanity check by checking for null values in the tree index, and re-index the tree as needed before actually using it. That way, you'll avoid incessant rewrites of your tree, and instead re-index it only when needed for a read.

An alternative albeit (much) more difficult approach is to use a variation of e.g. nested sets or nested intervals, but using reals or floats instead of integers. This allows to insert, move and delete nodes for free, at the cost of some storage and arithmetic/read overhead and the loss of some properties such as child node counts in the case of nested sets. However, it also requires that you keep an eye out for pathological edge-cases. Namely you'll need to periodically -- and sometimes preemptively -- "garbage collect" and re-index large enough chunks of the tree's index in order to fit new nodes when you run into the floating point type's precision limits.

(A variation of the latter is to use a numeric without any precision in order to try to dodge the problem. But it's actually kicking the can down the road, in the sense that you'll still be limited by Postgres internals of a few thousand digits of precision. And the storage and arithmetic overheads became material compared to just using a floating point type long before you run into that limit in my own tests from a few years back.)

As for a "The Best" structure or approach, there really is no magic bullet... Each has pros and cons based on the use-case (frequency of reads vs writes) and the size of the set. There's plenty of literature on the web that compare and explain each of them, which I'm sure you've found already.

That being said, for a CMS I'd advise that you go with whichever method you're most comfortable with. Either re-index the tree on the fly as writes occur, or mark the tree as dirty on writes and then re-indexing it on demand. The point here is that, if re-indexing is done right (= using a plpgsql function or equivalent, rather than a gazillion queries issued by your app), re-indexing an entire tree of a few hundred thousand nodes will a few hundred milliseconds at most. Assuming the tree isn't constantly getting updated, that's a perfectly acceptable overhead for end-users.

Denis de Bernardy
  • 75,850
  • 13
  • 131
  • 154
  • Thanks Denis for a great answer, I did however add a little extra about "Closure Tables", if you have any comments about that approach (especially potential scalability issues). Regarding your suggestions. Read performance comes first, so I guess you're suggesting a variation of Nested Sets and Intervals, and not even considering the other tree models as options? Materialized Path, Closure Tables (really interested in your opinion on this in terms of scalability?) and other variations? Well thanks for sharing your experiences with me and for giving me some inspiration. A big decision...! – Dac0d3r Sep 17 '14 at 15:12
  • To me, this decision really reeks of premature optimization. The key is how long it takes to reindex your tree in full. A few hundred thousand nodes can be re-indexed in a second or less using any of the usual methods, and that's perfectly acceptable for the typical CMS. So pick whichever you prefer: anything goes. If you've lots more nodes in a write-heavy environment, the hybrid nested set/interval approach I described further up is IMO the better approach. And if it's trillions of rows, indexing a daily snapshot using a cron will often become the correct approach. – Denis de Bernardy Sep 17 '14 at 15:40
  • In the case of Postgres, also keep in mind that you can use [recursive queries](http://www.postgresql.org/docs/current/static/queries-with.html). I'd gather that this is in fact what you *should* be doing during your prototyping phase at very least; make the switch to an indexed tree if and only if everyday use of your CMS shows that it's not fast enough. – Denis de Bernardy Sep 17 '14 at 15:46
  • Recursive Adjancency Lists actually looks to beat Nested Set according to these tests http://stackoverflow.com/a/4050802 - I tried to replicate the results, also trying recursive queries on nested sets, but they didn't even come close to recursive adjancency lists in PostgreSQL at least. I'll leave this thread open for a few days, and then probably accept your answer. But as I see it Nested Set was relevant before recursion making it into RDBMS'es - not anymore, since they're getting outperformed and implementation is a little more complex than adjancency lists. – Dac0d3r Sep 18 '14 at 01:21
  • It's Postrges 8.4, which was released aeons ago, but I'm not entirely surprised. And per my previous comment: "Keep in mind that you can use recursive queries. I'd gather that this is in fact what you should be doing during your prototyping phase at very least; make the switch to an indexed tree if and only if everyday use of your CMS shows that it's not fast enough." – Denis de Bernardy Sep 18 '14 at 17:35