7

I'm dealing with polygonal data in realtime here, but the problems quite simple. I have a huge list containing thousands of sets of polygon Indecies (Integers) and I need to simplify the list as "fast" as possible into a list of sets of "connected" Indecies. i.e. Any sets containing integers that are also in another set become one set in the result. I've read several possible solutions involving sets & graphs etc. All i'm after are a final list of sets which had any degree of commonality.

I'm dealing with lots of data here, but for simplicities sake here's some sample data:

setA = set([0,1,2])
setB = set([6,7,8,9])
setC = set([4,5,6])
setD = set([3,4,5,0])
setE = set([10,11,12])
setF = set([11,13,14,15])
setG = set([16,17,18,19])

listOfSets = [setA,setB,setC,setD,setE,setF,setG]

In this case I'm after a list with a result like this, although ordering is irrelevant:

connectedFacesListOfSets = [ set([0,1,2,3,4,5,6,7,8,9]), set([10,11,12,13,14,15]), set([16,17,18,19])]

I've looked for similar solutions, but the one with the highest votes gave incorrect results on my large test data.

Merge lists that share common elements

Community
  • 1
  • 1
andySnooker
  • 71
  • 1
  • 1
  • 3
  • lambacck's seems succinct and works for me so I'm going with that answer unless there's a faster solution. In my list of 4064 sets it takes less than 0.01 secs so that'll do me fine. – andySnooker May 27 '11 at 08:25
  • I'm curious as to how you finished your task using lambaack's method in .01 seconds. I've been running it for the past 2 minutes and its still not finished yet. What was your smallest/largest beginning set? I'm using a randomly generated list of 4000 sets. Not that I'm knocking lambaack or anything (my answer probably takes longer), I was just wondering what you used. – Bryce Siedschlaw May 27 '11 at 15:50
  • @Bryce Siedschlaw: What is your code for creating the sets? I'm interested to try on a larger group of sets. – lambacck May 27 '11 at 19:16
  • @lambacck I added code to my answer to generate the random sets list – Bryce Siedschlaw May 27 '11 at 19:33
  • I ran each answer with timeit--fastest was lamback 0.03 (11 function calls), followed by Thiago 0.07 (50 calls), and Bryce 0.1 (69 calls). – twneale May 28 '11 at 04:44
  • @twneale: be sure that if you did multiple iterations with timeit that you use a deep copy of listOfSets (`[set(x) for x in listOfSets]`) otherwise you are just re-running the already combined sets. – lambacck May 28 '11 at 18:06
  • @twneale: And please remove the print statements since they take too much time. Thanks. – Thiago Chaves May 28 '11 at 20:09
  • I have tested lambacck's and my method now. lambacck's is 10x faster by my measurements. I guess all those indirections and control maps take too much lot of time. – Thiago Chaves May 28 '11 at 21:46
  • @lambacck, Thiago: I ran timeit like this "t = timeit.Timer('thiago()', 'from __main__ import thiago'); print t.repeat(number=1000). For each, everything was inside the function. I also removed all the prints. – twneale May 29 '11 at 01:51

5 Answers5

4

It's hard to tell the performance without a sufficiently large set, but here is some basic code to start from:

while True:
    merged_one = False
    supersets = [listOfSets[0]]

    for s in listOfSets[1:]:
        in_super_set = False
        for ss in supersets:
            if s & ss:
               ss |= s
               merged_one = True
               in_super_set = True
               break

        if not in_super_set:
            supersets.append(s)

    print supersets
    if not merged_one:
        break

    listOfSets = supersets       

This works in 3 iterations on the provided data. And the output is as follows:

[set([0, 1, 2, 3, 4, 5]), set([4, 5, 6, 7, 8, 9]), set([10, 11, 12, 13, 14, 15]), set([16, 17, 18, 19])]
[set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), set([10, 11, 12, 13, 14, 15]), set([16, 17, 18, 19])]
[set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), set([10, 11, 12, 13, 14, 15]), set([16, 17, 18, 19])]
lambacck
  • 9,768
  • 3
  • 34
  • 46
2

This is a union find problem.

Though I haven't used it, this Python code looks good to me.

http://code.activestate.com/recipes/577225-union-find/

Rob Neuhaus
  • 9,190
  • 3
  • 28
  • 37
1

Forgive the messed up caps (autocorrect...):

# the results cotainer
Connected = set()

sets = # some list of sets

# convert the sets to frozensets (which are hashable and can be added to sets themselves)
Sets = map(frozenset, sets)

for s1 in sets:
    Res = copy.copy(s1)
    For s2 in sets:
        If s1 & s2:
            Res = res | s2
    Connected.add(res)  
twneale
  • 2,836
  • 4
  • 29
  • 34
  • This algorithm runs only once and would give the wrong answer in many situations. For example, for the input [(1,2),(2,3),(3,4)] it will answer incorrectly [(1,2,3),(1,2,3,4),(2,3,4)] – Thiago Chaves May 27 '11 at 17:20
0

So.. I think I got it. It's a mess but I got it. Here's what I did:

def connected_valid(li):
    for i, l in enumerate(li):
        for j, k in enumerate(li):
            if i != j and contains(l,k):
                return False
    return True

def contains(set1, set2):
    for s in set1:
        if s in set2:
            return True
    return False

def combine(set1, set2):
    set2 |= set1
    return set2

def connect_sets(li):
    while not connected_valid(li):
        s1 = li.pop(0)
        s2 = li[0]
        if contains(s1, s2):
            li[0] = combine(s1,s2)
        else:
            li.append(s1)
    return li

Then in the main function you'd do something like this:

setA = set([0,1,2])
setB = set([6,7,8,9])
setC = set([4,5,6])
setD = set([3,4,5,0])
setE = set([10,11,12])
setF = set([11,13,14,15])
setG = set([16,17,18,19])

connected_sets = connect_sets([setA,setB,setC,setD,setE,setF,setG,])

After running it, I got the following output

print connected_sets
[set([0,1,2,3,4,5,6,7,8,9]), set([10,11,12,13,14,15]), set([16,17,18,19])]

Hope that's what you're looking for.

EDIT: Added code to randomly generate sets:

# Creates a list of 4000 sets with a random number of values ranging from 0 to 20000
sets = []
ma = 0
mi = 21000
for x in range(4000):
    rand_num = sample(range(20),1)[0]
    tmp_set_li = sample(range(20000), rand_num)
    sets.append(set(tmp_set_li))

The last 3 lines can be condensed into one if you really wanted to.

Bryce Siedschlaw
  • 4,136
  • 1
  • 24
  • 36
  • If your going for speed, you will want to remove the function calls, since they are a well known slowdown. Also searching for a non-connected set on each iteration of the `while` loop takes a long time, better to track that as you iterate and keep a flag. – lambacck May 26 '11 at 23:21
0

I tried to do something different: this algorithm loops once for each set and once for each element:

# Our test sets
setA = set([0,1,2])
setB = set([6,7,8,9])
setC = set([4,5,6])
setD = set([3,4,5,0])
setE = set([10,11,12])
setF = set([11,13,14,15])
setG = set([16,17,18,19])

list_of_sets = [setA,setB,setC,setD,setE,setF,setG]

# We will use a map to store our new merged sets.
# This map will work as an reference abstraction, so it will
# map set ids to the set or to other set id.
# This map may have an indirection level greater than 1
merged_sets = {}

# We will also use a map between indexes and set ids.
index_to_id = {}

# Given a set id, returns an equivalent set id that refers directly
# to a set in the merged_sets map
def resolve_id(id):
    if not isinstance(id, (int, long)):
        return None
    while isinstance(merged_sets[id], (int, long)):
        id = merged_sets[id]
    return id


# Points the informed set to the destination id
def link_id(id_source, id_destination):
    point_to = merged_sets[id_source]
    merged_sets[id_source] = id_destination
    if isinstance(point_to, (int, long)):
        link_id(point_to, id_destination)


empty_set_found = False
# For each set
for current_set_id, current_set in enumerate(list_of_sets):
    if len(current_set) == 0 and empty_set_found:
        continue
    if len(current_set) == 0:
        empty_set_found = True
    # Create a set id for the set and place it on the merged sets map
    merged_sets[current_set_id] = current_set
    # For each index in the current set
    possibly_merged_current_set = current_set
    for index in current_set:
        # See if the index is free, i.e., has not been assigned to any set id
        if index not in index_to_id:
            # If it is free, then assign the set id to the index
            index_to_id[index] = current_set_id
            # ... and then go to the next index
        else:
            # If it is not free, then we may need to merge the sets
            # Find out to which set we need to merge the current one,
            # ... dereferencing if necessary
            id_to_merge = resolve_id(index_to_id[index])
            # First we check to see if the assignment is to the current set or not
            if id_to_merge == resolve_id(merged_sets[current_set_id]):
                continue
            # Merge the current set to the one found
            print 'Merging %d with %d' % (current_set_id, id_to_merge)
            merged_sets[id_to_merge] |= possibly_merged_current_set
            possibly_merged_current_set = merged_sets[id_to_merge]
            # Map the current set id to the set id of the merged set
            link_id(current_set_id, id_to_merge)
# Return all the sets in the merged sets map (ignore the references)
print [x for x in merged_sets.itervalues() if not isinstance(x, (int, long))]

It prints:

Merging 2 with 1
Merging 3 with 0
Merging 3 with 1
Merging 5 with 4
[set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), set([10, 11, 12, 13, 14, 15]), set([16, 17, 18, 19])]
Thiago Chaves
  • 9,218
  • 5
  • 28
  • 25