2

Possible Duplicate:
Python: simple list merging based on intersections

I am trying to classify objects. Each object is identified by a unique identifier property called id. So my classification logic goes like this. First i prepare a list of objects and then the classification function takes 2 objects at a time and returns a frozenset containing their id. So if object1 and object5 are in the same category a frozenset(id1,id5) is returned. Now i keep adding these frozensets to a set so in the end i have a set like this

matched_set=(
             frozenset(id1,id2),
             frozenset(id9,id3),
             frozenset(id9,id2),
             frozenset(id24,id22),
             frozenset(id1,id23),
             frozenset(id25,id24),
             frozenset(id30,id24)
            )

Now because objects with id1 and id2 are in the same category, objects with id9 and id3 are in the same category, objects with id9 and id2 are in the same category, objects with id1,id2,id3,id9 should be in same category. So i should have a set like this set(id1,id2,id3,id9) Can someone provide an algorithm to do so? Thanks

Community
  • 1
  • 1
lovesh
  • 5,235
  • 9
  • 62
  • 93

1 Answers1

6

It sounds like you're looking for a disjoint-set datastructure.

Given your set of id's, your categories separate them into disjoint subsets. A disjoint-set datastructure represents each category by choosing a representative id, which will be returned by a query of any of its members. (isolated id's form one category apiece, and return themselves)

Updates to a disjoint-set datastructure combine the categories of any two id's, so that future queries return the same representative for members of both subsets. (if the two id's are already members of the same category, the update is functionally a no-op)

The usual method is to represent each category as a reverse-tree: each id has a parent link, but no child links. The "representative element" is the root of the tree, which is easy to query by following the parent links. An update requires finding the root of the trees of both id's, and (if they are different) merging the trees by making one root the parent of the other.

By adding a couple of simple optimizations (queries "collapse" the query path to point directly to the root, and updates always choose the root of the deepest tree as the merge parent), this algorithm becomes extremely efficient, running in "almost-O(1)" amortized time.

If you want online access to the complete list of id's in each category, you should maintain a cumulative list attached to each category root, and concatenate them in each merge. In general, it can be convenient to maintain any number of statistics about your categories this way.

comingstorm
  • 25,557
  • 3
  • 43
  • 67