1

I have data in a Mongo database collection where each document has the id of a parent. If I want to search all documents that have a particular document (I'll refer to it as P) in its ancestry (ie P is it's parent, grandparent, great grandparent, etc), what are my options to do that efficiently, and what are those options strengths and weaknesses?

I can think of the following:

  • Store the whole ancestry in each document, so you can search for documents who's ancestry list contains P.
    • Strengths:
      • constant time look up
    • Weaknesses:
      • If a parent is changed, the corresponding update is O(n), where n is the number of descendants of the document who's parent changed
      • Some storage overhead, O(a) where a is the average depth of a document
  • When searching, first build a list of the ids of P's child documents, then grandchild documents, etc. Then search all documents with those ids
    • Strengths:
      • No change is needed to the storage structure, no additional space overhead
    • Weaknesses:
      • Building up the list of ids is an O(n) operation, where n is the number of documents descendant from P
      • Searching by possibly hundreds of ids might not be efficient

Anyone know other techniques?

B T
  • 57,525
  • 34
  • 189
  • 207
  • 1
    Have you looked into - http://docs.mongodb.org/manual/applications/data-models-tree-structures/ ? – BatScream Dec 02 '14 at 22:07
  • I haven't, but its a very relevant reference. Looks like the "array of ancestors" and "materialized paths" are essentially the same thing and are both my first option up there. Nested sets isn't an option for me, and the other two are essentially what i'm already doing – B T Dec 02 '14 at 23:40

1 Answers1

0

To normalize or not to normalize; I believe thats the NoSQL underbelly that often gets people turning back to SQL /RDBMS. I prefer not to normalize in order to provide near-real-time indexed simple queries with basic backend and frontend code. Heres some pseudocode in a related question that shows the complex code needed when normalizing. Its hard to emulate joins and relationships in NoSQL.

Granted NoSQL does open up a bonafide rats nest of issues like your "If a parent is changed, the corresponding update is O(n), where n is the number of descendants of the document who's parent changed" I call those 'relational maintenance scripts'. But I find you can run those on a scheduled (crontab) basis during off-hours. One might also strongly consider safe tables/collections and build volatile or working tables. See this question regarding OLAP tables. There you can have your relationships in nice neat tables and then create those ugly fast collections.

Its a matter of determining whether NoSQL is really for you. Even on a personal level, do you prefer faster, scalable and messy -- or slower, unscalable and neat/organized. The tradeoffs resembling the classic fast, good and cheap triangle. Basically, NoSQL is fast and cheap; SQL is good. NoSQL's good is scalability; while SQL's fast is actually respectable.

Community
  • 1
  • 1
Gabe Rainbow
  • 3,658
  • 4
  • 32
  • 42
  • Well, so actually I think you'd have exactly the same problem in relational databases. Joins or no joins, a tree is not easily represented in SQL either. Running the "relational maintenance scripts" offline is only ok if you're ok with essentially corrupted data. This doesn't work for me. I'm not very clear what you're suggesting or recommending. The links you gave seem only tenuously related to my question. – B T Dec 02 '14 at 23:56
  • 1
    im suggesting you do option 1. one can conduct a single (yet complex) sql statement for trees but with nosql that is impossible. thus either structure the data to suit a single query or create a very complex and slow map/reduce or backend. for me, its fast and ugly data and clean proprietary code. you might also look at neo4j or similar graph nosql and maybe those will offer a nice balance b/w speed and ugly data structures. – Gabe Rainbow Dec 05 '14 at 20:51
  • I guess that kind of recursive query that isn't possible in something like mysql, but is in other systems. So I'm ok with multiple queries, which makes it just as possible as it is in sql, and basically just as scalable too. To do it faster, even in SQL with recursive queries, I contend you'd have to build the same messy data. But thanks. – B T Dec 06 '14 at 01:18