1

I have a project where I need to build and store large trees of data in Ruby. I am considering different approaches for serialization, deserialization and querying of trees, and I am wondering what would be the best way to go. My major constraints are read time, query efficiency and and cross-version/cross-platform compatibility. The most frequent operation is to retrieve sets of nodes based on a combination of id/value and/or feature(s).Trees can be up to 15-20 levels deep. Moving subtrees is an uncommon procedure, but should be possible without too much black magic. Rails integration is not a primary concern. The options I thought about, along with some issues I'm concerned about, are the following:

  • Marshal the trees, and when needed load them into memory and query them in Ruby (inefficiency as tree grows, cross-version compatibility?)
  • Same as above, but use YAML (more cross-version compatible, but less efficient?)
  • Same as above, but use a custom XML parser (need to recreate objects from scratch each time the tree is loaded?)
  • Serialize the trees to XML, store them in an XML database (e.g. Sedna) and use XPath to query the trees (no experience with this approach, not sure about efficiency?)
  • Use adjacency lists to query trees stored in an schema-less database (inefficiency when counting descendants?)
  • Use materialized paths (potential of overfilling the max string length for deep trees?)
  • Use nested sets (complex SQL queries?)
  • Use the array of ancestors approach? Seems interesting in terms of querying efficiency according to the MongoDB page, but I haven't been able to find any serious discussion of this algorithm.

Based on your experience, which approach would better fit with the constraints I have described? If I go for an XML database, are there ones that would be more suited for this project? Are there other approaches I have overlooked that would be more efficient? Thanks for your time.

user2398029
  • 6,699
  • 8
  • 48
  • 80
  • 1
    on my job we've experienced good results storing nodes as records with relevant properties as column attributes and a special prior column referring to the parent node or null if there is none. subtrees can be assembled using recursive query constructs available in several sql dialects, stored procs or self-joins if result sets are sparse and the max possible tree depth can be bound. moving subtrees means updating the prior columns of a given value. the mapping to xml reps & xpath expressions is straightforward. – collapsar Mar 05 '12 at 10:51
  • Does this have a SQL tag because you are considering storing the trees in a relational database? – Gordon Linoff May 09 '12 at 15:52
  • That's right! Are you asking because you have experience storing trees in a relational database? :) – user2398029 May 09 '12 at 16:44
  • I have experience storing a lot of varied types of data in relational databases. – Gordon Linoff May 09 '12 at 21:45
  • See [this answer](http://stackoverflow.com/a/192462/623041). – eggyal May 10 '12 at 14:45

3 Answers3

3

Trees work really well with graph databases, such as neo4j: http://neo4j.org/learn/

Neo4j is a graph database, storing data in the nodes and relationships of a graph. The most generic of data structures, a graph elegantly represents any kind of data, preserving the natural structure of the domain.

Ruby has a good interface for the trees: https://github.com/andreasronge/neo4j

Pacer is a JRuby library that enables very expressive graph traversals. Pacer allows you to create, modify and traverse graphs using very fast and memory efficient stream processing. That also means that almost all processing is done in pure Java, so when it comes the usual Ruby expressiveness vs. speed problem, you can have your cake and eat it too, it's very fast!

https://github.com/pangloss/pacer

Neography is like the neo4j.rb gem and suggested by Ron in the comments (thanks Ron!)

https://github.com/maxdemarzi/neography

joelparkerhenderson
  • 34,808
  • 19
  • 98
  • 119
  • I recently started looking into using neo4j in Ruby. At first I tried the `neo4j.rb` gem but lately I've been liking `neography` https://github.com/maxdemarzi/neography – Ron May 15 '12 at 20:27
2

Since you are considering a SQL approach, here are some things to think about.

First, how big are the trees? For many applications, 10,000 leafs would seem big. Yet this is small for a database. On any decent database system (like a laptop), you should be able to store hunreds of thousands or millions of leafs in memory.

What a database buys you over other approaches is:

-- Not having to worry about memory/disk performance. When the data spills over to disk, you don't take a big hit on performance. By comparison, consider what happens when a hash table overflows memory.

-- Being able to add indexes to optimize performance.

-- Being able to alter your access path for the tree "just" by modifying SQL

One of the problems with standard SQL is that you can represent a tree node as a simple pair: , , . Then, with a simple join, you can move between parents and leafs. However, the joins accumulate as you move up the tree.

Sigh. Different databases have different solutions for this. SQL Server has recursive CTEs, which let you traverse the tree. Oracle has another approach for tree structures.

This starts to get complicated.

Perhaps a better approach is to assign a "leaf" id based on the hierarchy in the tree. So, if this is a binary tree, then "10011" would be the node at right branch, left branch, left branch, right branch, right branch. There you would store information . . . such as whether it has children and whatever else. Getting the parent is easy, because you can just truncate the last digit.

You can see how this would generalize to non-binary trees. Having any number of children could pose a little challenge.

I believe this may be related to the "array of ancestors" approach.

As I think about it, I think this would work pretty well. I would then suggest that you define separate stored procedures for each action that you want:

usp_tree_FetchNode (NodeId) usp_tree_GetParent (NodeId) usp_tree_NodeDelete (NodeId) usp_tree_FetchSubTree (NodeId) etc. etc. etc.

Although SQL does not really support object-oriented programming, you can still organize your code with clean naming conventions and good function wrappers.

I actually think this might work and provide a pretty good method for developing the code. One nice side effect is that you can analyze the tree outside the application, which might suggest future enhancements.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

Have you looked at ancestry gem? I've used it for simple trees, but by the description it looks to fit on your requirements.

Paulo Fidalgo
  • 21,709
  • 7
  • 99
  • 115