6

I have a large tree that grows as my algorithm progresses. Each node contains set, which I suppose is implemented as balanced binary search tree. Each node's set shall remain fixed after that node's creation, before its use in creating that node's children.

I fear however that copying each set is prohibitively expensive. Instead, I would prefer that each newly created node's set utilize all appropriate portions of the parent node's set. In short, I'm happy copying O(log n) of the set but not O(n).

Are there any variants of the STL's associative data structures that offer such an partial copy optimization? Perhaps in Boost? Such a data structure would be trivial to implement in Haskell or OCaML of course, but it'd require more effort in C++.

rcollyer
  • 10,475
  • 4
  • 48
  • 75
Jeff Burdges
  • 4,204
  • 23
  • 46
  • What are the contents of the sets? I did not fully understand what your problem is, it seems that in each level the set is a strict subset of the parents set, is it? – David Rodríguez - dribeas Oct 04 '11 at 16:30
  • 1
    Binary search trees don't in general _have_ meaningful subtrees, unless your child node happens to correspond to a contiguous ordered subset of the parent node's set. In any case, you're already implementing your own tree structure if you have parent & child nodes, so why not just keep the data at the right level? – Useless Oct 04 '11 at 16:40
  • Also good to know, if a node contains a subset of the parents', is the subset contiguous? – Mooing Duck Oct 04 '11 at 16:53
  • Each level is a strict superset. All these sets exist simply to terminate branches of the outer tree. I've currently an unsorted manual linked list that simple retraces each tree branch, providing O(n) search for duplicates, O(1) insertion, and O(1) additional space per level. I'm willing to pay O(log n) across the board, but not higher. On average, a set insertion should modify O(log n) of the tree, leaving many subtrees invariant. An optimal implementation will frequently change less, but occasionally change much more, like Useless pointed out. Average case should be O(log n) however. – Jeff Burdges Oct 04 '11 at 17:46
  • You keep mentioning O(log(n)). Is there a reason for that value? Does each node contain only log(n) of it's parents' set? – Mooing Duck Oct 04 '11 at 18:17

3 Answers3

1

I know it's not generally productive to suggest a different language, but Haskell's standard container libraries do exactly this. I remember seeing a video (was it Simon Peyton Jones?) talking about this exact problem, and how a Haskell solution ended up being much faster than a C++ solution for the given programmer effort. Of course, this was for a specific problem that had a lot of sets with a lot of shared elements.

There is a fair amount of research into this subject. If you are looking for keywords, I suggest searching for "functional data structures" instead of "immutable data structures", since most functional paradigms benefit from immutability in general. Structures such as finger tree were developed to solve exactly this problem.

I know of no C++ library that implements these data structures. There is nothing stopping you from reading the relevant papers (or the Haskell source code, which is about 1k lines for Data.Set including tests) and implementing it yourself, but I know that is not what you'd want to hear. You'd also need do some kind of reference counting for the shared nodes, which for such deep structures can have a higher overhead than even simple garbage collectors.

Dietrich Epp
  • 205,541
  • 37
  • 345
  • 415
  • In fact, I'd selected C++ simply because I far more skilled at profiling C++ code. So yes telling me that well optimized containers existing in Haskell helps. – Jeff Burdges Oct 04 '11 at 17:29
  • 2
    @JeffBurdges: Finger Trees and semi persistent data structures should be useful to you then. – Matthieu M. Oct 04 '11 at 17:47
0

It's practically impossible in C++, since the notion of an immutable container doesn't exist. You may know that you'll be making no changes, and that some sort of shared representation would be preferable, but the compiler and the library don't, and there's no way of communicating this to them.

James Kanze
  • 150,581
  • 18
  • 184
  • 329
0

Each node contains set, which I suppose is implemented as balanced binary search tree. Each node's set shall remain fixed after that node's creation, before its use in creating that node's children.

That's a pretty unique case. I would recommend using std::vector instead. (No really!) The code is creating the node can still use a set, and switching to a vector at the last second. However, the vector is smaller, and takes only a tiny number of memory allocations (one if you use reserve), making the algorithm much faster.

typedef unsigned int treekeytype;
typedef std::vector<unsigned int> minortreetype;
typedef std::pair<treekeytype, minortreetype> majornode
typedef std::set<treekeytype, minortreetype> majortype;
majortype majortree;

void func(majortype::iterator perform) {
     std::set<unsigned int> results;
     results.assign(perform->second.begin(), perform->second.end());
     majortree[perform->first+1].assign(results.begin(), results.end()); //the only change is here
     majortype::iterator next = majortree.find(perform->first+1);
     func(next);
}

You can even use std::lower_bound and std::upper_bound to still get O(log(n)) memory accesses since it's still sorted the same as the set was, so you won't lose any efficiency. It's pure gain as long as you don't need to insert/remove frequently.

I fear however that copying each set is prohibitively expensive.

If this fear is caused because each set contains mostly the same nodes as it's parents, and the data is costly (to copy or in memory, whichever), with only a few nodes changed, make the subtrees contain std::shared_pointers instead of the data themselves. This means the data itself will not get copied, only the pointers.

I realize this isn't what you were aiming at with the question, but as JamesKanze said, I know of no such container. Other than possibly a bizarre and dangerous use of the STL's rope class. Note that I said and meant STL, not the standard C++ library. They're different.

Mooing Duck
  • 64,318
  • 19
  • 100
  • 158
  • I'm afraid it's the set size itself that the problem, well they're already set of pointers. Rope had occurred to me, but I've never used one before. Thanks, I'll check them out. – Jeff Burdges Oct 04 '11 at 17:39
  • I really just recommend the vector thing then. vectors (and queues) are _waaay_ faster than trees if you don't need frequent inserts/deletes. – Mooing Duck Oct 04 '11 at 18:15