4

I have to implement a data structure that groups the elements of a equivalence classes.

The API:

interface Grouper<T>{
  void same(T l, T r);
  Set<EquivalenceClass<T>> equivalenceClasses();
}

interface EquivalenceClass<T>{
    Set<T> members();
}

For example the grouping behaves like this:

Grouper g;
g.same(a, b);
g.equivalenceClasses() -> [[a,b]]

g.same(b, a);
g.equivalenceClasses() -> [[a,b]]

g.same(b, c);
g.equivalenceClasses() -> [[a,b,c]]

g.same(d, e);
g.equivalenceClasses() -> [[a,b,c], [d,e]]

g.same(c, d);
g.equivalenceClasses() -> [[a,b,c,d]]

I'm looking for an implementation that works up to ~10 million entries. It should be optimized to fill it and get the equivalence classes once.

Thomas Jung
  • 32,428
  • 9
  • 84
  • 114
  • 1
    [`boost::disjoint_sets`](http://www.boost.org/doc/libs/1_45_0/libs/disjoint_sets/disjoint_sets.html) comes to mind. Warning: doucmentation hard to understand. There are examples somewhere on Stack Overflow. – Sven Marnach Dec 09 '10 at 16:27
  • Here is one question dealing with `boost::disjoint_sets`: http://stackoverflow.com/questions/4134703/understanding-boostdisjoint-sets. The comments contain further pointers. `boost::disjoint_sets` is an implementation of the union-find algorithm Larry mentions. – Sven Marnach Dec 09 '10 at 17:12
  • Change your title to implementation not data structure. – Saeed Amiri Dec 09 '10 at 18:54

3 Answers3

5

Take a look at Union-Find. The union ("same") can be done trivially in O(log N), and can be done in effectively O(1) with some optimizations. The "equivalenceClasses" is O(N), which is the cost of visiting everything anyways.

Larry
  • 4,491
  • 2
  • 24
  • 16
1

If you are only going to query the equivalences classes once, the best solution is to build an undirected graph over the elements. Each equivalence is an undirected edge between the two items, and the equivalence classes correspond to the connected components. The time and space complexity will both be linear if you do it right.

Alternatively, you can use a Union-Find data structure, which will give you almost-linear time complexity. It may also be considered simpler, because all the complexities are encapsulated into the data structure. The reason Union-Find is not linear comes down to supporting efficient queries while the classes are growing.

Craig Gidney
  • 17,763
  • 5
  • 68
  • 136
0

Union-find is the best data structure for your problem, as long you only care about total running time (some operations may be slow, but the total cost of all operations is guaranteed to be nearly linear). Enumerating the members of each set is not typically supported in the plain version of union-find in textbooks though. As the name suggests, union-find typically only supports union (i.e., same) and find, which returns an identifier guaranteed to be the same as the identifier returned by a call to find on an element in the same set. If you need to enumerate the members of each set, you may have to implement it yourself so you can add, for example, child pointers so that you can traverse each tree representing a set.

If you are implementing this yourself, you don't have to implement the full union-find data structure to achieve amortized O(lg n) time per operation. Essentially, in this "light" version of union-find, each set would be a singly linked list with an extra pointer inside each node that points to a set identifier node that can be used to test whether two nodes belong to the same list. When the same method is executed, you can just append the smaller list to the larger and update the set identifiers for the elements of the smaller list. The total cost is at most O(lg n) per element because an element can be a member of the smaller list involved in a same operation at most O(lg n) times.

jonderry
  • 23,013
  • 32
  • 104
  • 171