4

While working on a feature selector for a machine learning algorithm in Python, I generated a data structure with the following code:

# Perform set partitioning on the results
groups = []
for t in results:
    (jthName,kthName) = t
    jthGroup = -1
    kthGroup = -1

    # Just a simple list of hashes with online merging
    for idx,group in enumerate(groups):
        if jthName in group:
            jthGroup = idx
        if kthName in group:
            kthGroup = idx
    if jthGroup == kthGroup:
        if jthGroup == -1: # Implicit: "and kthGroup == -1"
            groups.append(set((jthName,kthName)))
    elif jthGroup != kthGroup:
        if kthGroup == -1:
            # Merge kthName into jthGroup
            groups[jthGroup].add(kthName)
        elif jthGroup == -1:
            # Merge jthName into kthGroup (redundant if naturally-ordered)
            groups[kthGroup].add(jthName)
        else:
            # Merge jthGroup and kthGroup, since we have a connecting pair
            merged = set()
            merged.update(groups[jthGroup])
            merged.update(groups[kthGroup])
            groups.remove(groups[jthGroup])
            groups.remove(groups[kthGroup])
            groups.append(merged)

Where my input, results, is a list of tuple{2} and groups is a list of set. Note that my code isn't necessarily efficient here; it's provided only for illustrative purposes.

My data structure, groups, has the following properties:

  • For each (jthName,kthName):

    • If neither element of (jthName,kthName) is found in any contained set, create set((jthName,kthName)) within our list of sets.
    • If exactly one of (jthName,kthName) is found in one contained set, coalesce the unfound element into that set.
    • If each element of (jthName,kthName) is found in a different set, coalesce the two referenced sets into a single set.
  • Loop invariant: jthName and kthName cannot be contained in more than one set.


My justification for this data structure is to create a flat decomposition of an unknown set of connected nodegraphs, where each unique element name is a node and each unique pair is an edge. My rationale is that my graphs are incomplete, and I require this view to select only the known members of each graph to feed into an algorithm that will regressively determine graph connectivity and directionality of the edges (that is, the complete set of DAGs expressed by the data). But, I digress.

Is there a friendly name for the data structure represented by variable groups? If so or if not, is there a more time- or space-efficient means of performing this decomposition?

MrGomez
  • 23,788
  • 45
  • 72
  • May be more appropriate on http://cstheory.stackexchange.com/. I didn't post it there because this is, to my knowledge, an undergraduate level question from an undertrained theorist. – MrGomez Mar 06 '12 at 01:35

1 Answers1

7

I think what you're looking for is something called a Disjoint-set data structure.

It's often used when doing Kruskal's because it allows you to do n lookups in amortized nlog*n (actually less than that) time if you implement the disjoint-set data structure with path compression.

It's pretty reasonable to implement and I think the wiki page pseudo code lends itself nicely to python. If you need more assistance, this SO question might help.

If you used the disjoint-set data structure, your code would look like this:

for t in results:
   (jName, kName) = t

   union(jName, kName)
Community
  • 1
  • 1
Ian Bishop
  • 5,185
  • 3
  • 26
  • 37