-1

I built a disjoint-set data structure for use with Kruskal's MST algorithm. I need to load and then union a graph with 200k interconnected nodes and I think my data structure implementation is a bottleneck.

Do you have an suggestions for how to improve performance? I think my find method might be problematic.

class partition(object):
    def __init__(self, element=None):
        self.size = 0
        if element == None:
            self.contents = set()
            self.representative = None
        else:
            self.contents = {element}
            self.representative = element
            self.size = 1

    def find(self, element):
        return element in self.contents

    def add(self, partition):
        self.contents = self.contents.union(partition)
        self.size = len(self.contents)

    def show(self):
        return self.contents

    def __repr__(self):
        return str(self.contents)

class disjoint_set(object):
    def __init__(self):
        self.partitions_count = 0
        self.forest = {}

    def make_set(self, element):
        if self.find(element) == False:
            new_partition = partition(element)
            self.forest[new_partition.representative] = new_partition
            self.partitions_count += 1

    def union(self, x, y):
        if x != y:
            if self.forest[x].size < self.forest[y].size:
                self.forest[y].add(self.forest[x].show())
                self.delete(x)
            else:
                self.forest[x].add(self.forest[y].show())
                self.delete(y)

    def find(self, element):
        for partition in self.forest.keys():
            if self.forest[partition].find(element):
                return self.forest[partition].representative
        return False

    def delete(self, partition):
        del self.forest[partition]
        self.partitions_count -= 1

if __name__ == '__main__':
    t = disjoint_set()
    t.make_set(1)
    t.make_set(2)
    t.make_set(3)
    print("Create 3 singleton partitions:")
    print(t.partitions_count)
    print(t.forest)
    print("Union two into a single partition:")
    t.union(1,2)
    print(t.forest)
    print(t.partitions_count)

EDIT:

After reading the comments and doing additional research I realized how poorly designed my original algorithm was. I started over from scratch and put this together. I put all the partitions into a single hash table and used path compression in the find(). How does this look and are there any glaring problems I should address?

class disjoint_set(object):
def __init__(self):
    self.partitions_count = 0
    self.size = {}
    self.parent = {}

def make_set(self, element):
    if self.find(element) == False:
        self.parent[element] = element
        self.size[element] = 1
        self.partitions_count += 1

def union(self, x, y):
    xParent = self.find(x)
    yParent = self.find(y)
    if xParent != yParent:
        if self.size[xParent] < self.size[yParent]:
            self.parent[xParent] = yParent
            self.size[yParent] += self.size[xParent]
            self.partitions_count -= 1
        else:
            self.parent[yParent] = xParent
            self.size[xParent] += self.size[yParent]
            self.partitions_count -= 1

def find(self, element):
    if element in self.parent:
        if element == self.parent[element]:
            return element
        root = self.parent[element]
        while self.parent[root] != root:
            root = self.find(self.parent[root])
        self.parent[element] = root
        return root
    return False

if __name__ == '__main__':
    t = disjoint_set()
    t.make_set(1)
    t.make_set(2)
    t.make_set(3)
    t.make_set(4)
    t.make_set(5)
    print("Create 5 singleton partitions")
    print(t.partitions_count)
    print("Union two singletons into a single partition")
    t.union(1,2)
    print("Union three singletones into a single partition")
    t.union(3,4)
    t.union(5,4)
    print("Union a single partition")
    t.union(2,4)
    print("Parent List: %s" % t.parent)
    print("Partition Count: %s" % t.partitions_count)
    print("Parent of element 2: %s" % t.find(2))
    print("Parent List: %s" % t.parent)
Solomon Bothwell
  • 1,004
  • 2
  • 12
  • 21
  • can you add an if __name__ == '__main__' section to show usage? –  Apr 26 '17 at 17:03
  • yes sorry about that! Adding now. – Solomon Bothwell Apr 26 '17 at 17:07
  • [Use a real disjoint-set forest data structure.](https://en.wikipedia.org/wiki/Disjoint-set_data_structure#Disjoint-set_forests) You've just picked a few names that sound kind of like a disjoint set forest and then written a highly naive algorithm that doesn't have anything to do with a real disjoint set forest. – user2357112 Apr 26 '17 at 17:11
  • Try Up tree data structure? – Bobby Apr 26 '17 at 17:18

1 Answers1

0

I guess your find implementation is not running effienctly, which it supposed to be.

Following changes may help.

class disjoint_set(object):
    def __init__(self):
        self.partitions_count = 0
        self.forest = {}
        self.parent = {}

    def make_set(self, element):
        if not self.find(element):
            new_partition = partition(element)
            self.parent[element] = element
            self.forest[new_partition.representative] = new_partition
            self.partitions_count += 1

def union(self, x, y):
    if x != y:
        if self.forest[x].size < self.forest[y].size:
            self.forest[y].add(self.forest[x].show())
            #Update parent details 
            self.parent[self.forest[x].representative] = self.forest[y].representative
            self.delete(x)
        else:
            self.forest[x].add(self.forest[y].show())
            #Update parent details 
            self.parent[self.forest[y].representative] = self.forest[x].representative
            self.delete(y)

def find(self, element):
    if self.parent[element] == element:
        return element
    else:
        return find(element)

The code can be still optimized with path-compression to have disjoint_set.find to run in O(1). I guess O(log n) still is good for big numbers.

Another bottleneck could be your union function. Especially the add function implementation.

def add(self, partition):
    self.contents = self.contents.union(partition)

Try using an update method of set (which is an inplace union). I think is causing lot of memory overhead for huge no.of nodes. Something like

self.contents.update(partition)

There is a nice discussion on set union and update functions here.

Hope it helps!

Community
  • 1
  • 1
arunk2
  • 2,246
  • 3
  • 23
  • 35
  • 1
    That's not path compression at all. This algorithm has essentially nothing to do with disjoint set forests or path compression. – user2357112 Apr 27 '17 at 00:09
  • By path compression, I mean keeping the execution of find in O(1). Agree to what you said - it is not a path compression implementation. The intention of path compression is to keep the operation in O(1) right? thats some thing I want to convey. Removed the mention of path-compression. Thanks for pointing it. – arunk2 Apr 27 '17 at 00:13
  • It's not O(1) either, and path compression doesn't make find run in O(1). The `partition.find` method here is O(1), but that's not a standard disjoint set operation; the actual disjoint set find operation is implemented by `disjoint_set.find`, where we have to actually find which set the element belongs to. That's O(len(self.forest)), which is terrible. – user2357112 Apr 27 '17 at 00:18
  • Agree. There is a flaw in the implementation of disjoint_set.find. I have made a few changes. @user2357112 check if this can be improved or any other flaw. – arunk2 Apr 27 '17 at 00:56
  • Thanks @ArunKumar for the comments. I'll try these changes out. – Solomon Bothwell Apr 27 '17 at 05:45
  • @user2357112 I understand that my implementation is not efficient. I am trying to learn about this data structure and I'm sorry if that upsets you. – Solomon Bothwell Apr 27 '17 at 05:47
  • Your `find` function is ill-defined (`return self.find(parent[element])` perhaps?), and your indentation needs fixing. – chepner Apr 28 '17 at 03:11
  • @chepner thanks for the edit. Between I was not giving a working code. But suggested few changes in implementation for performance as requested. – arunk2 Apr 28 '17 at 05:16