3

In the list of tuples called mixed_sets, three separate sets exist. Each set contains tuples with values that intersect. A tuple from one set will not intersect with a tuple from another set.

I've come up with the following code to sort out the sets. I found that the python set functionality was limited when tuples are involved. It would be nice if the set intersection operation could look into each tuple index and not stop at the enclosing tuple object.

Here's the code:

mixed_sets=  [(1,15),(2,22),(2,23),(3,13),(3,15),
              (3,17),(4,22),(4,23),(5,15),(5,17),
              (6,21),(6,22),(6,23),(7,15),(8,12),
              (8,15),(9,19),(9,20),(10,19),(10,20),
              (11,14),(11,16),(11,18),(11,19)]

def sort_sets(a_set):
    idx= 0
    idx2=0
    while len(mixed_sets) > idx and len(a_set) > idx2:
        if a_set[idx2][0] == mixed_sets[idx][0] or a_set[idx2][1] == mixed_sets[idx][1]:
            a_set.append(mixed_sets[idx])
            mixed_sets.pop(idx)
            idx=0

        else:
            idx+=1
            if idx == len(mixed_sets):
                idx2+=1
                idx=0
    a_set.pop(0) #remove first item; duplicate
    print a_set, 'a returned set'            
    return a_set

sorted_sets=[]
for new_set in mixed_sets:
    sorted_sets.append(sort_sets([new_set]))

print mixed_sets #Now empty.

OUTPUT:
[(1, 15), (3, 15), (5, 15), (7, 15), (8, 15), (3, 13), (3, 17), (5, 17), (8, 12)] a returned set
[(2, 22), (2, 23), (4, 23), (6, 23), (4, 22), (6, 22), (6, 21)] a returned set
[(9, 19), (10, 19), (10, 20), (11, 19), (9, 20), (11, 14), (11, 16), (11, 18)] a returned set

Now this doesn't look like the most pythonic way of doing this task. This code is intended for large lists of tuples (approx 2E6) and I felt the program would run quicker if it didn't have to check tuples already sorted. Therefore I used pop() to shrink the mixed_sets list. I found using pop() made list comprehensions, for loops or any iterators problematic, so I've used the while loop instead.

It does work, but is there a more pythonic way of carrying out this task that doesn't use while loops and the idx and idx2 counters?.

JBernardo
  • 32,262
  • 10
  • 90
  • 115
Naaaysmith
  • 81
  • 5
  • 1
    See [this question](http://stackoverflow.com/questions/9110837/python-simple-list-merging-based-on-intersections) for multiple solutions to a variant of this problem. – DSM Aug 09 '12 at 21:50

1 Answers1

0

Probably you can increase the speed by first computing a set of all the first elements in the tuples in the mixed_sets, and a set of all the second elements. Then in your iteration you can check if the first or the second element is in one of these sets, and find the correct complete tuple using binary search. Actually you'd need multi-sets, which you can simulate using dictionaries.

Something like[currently not tested]:

from collections import defaultdict
# define the mixed_sets list.
mixed_sets.sort()
first_els = defaultdict(int)
secon_els = defaultdict(int)

for first,second in mixed_sets:
    first_els[first] += 1
    second_els[second] += 1


def sort_sets(a_set):
    index= 0
    while mixed_sets and len(a_set) > index:
        first, second = a_set[index]
        if first in first_els or second in second_els:
            if first in first_els:
                element = find_tuple(mixed_sets, first, index=0)
                first_els[first] -= 1
                if first_els[first] <= 0:
                    del first_els[first]
            else:
                element = find_tuple(mixed_sets, second, index=1)
                second_els[second] -= 1
                if second_els[second] <= 0:
                    del second_els[second]

            a_set.append(element)
            mixed_sets.remove(element)
        index += 1
    a_set.pop(0) #remove first item; duplicate
    print a_set, 'a returned set'            
    return a_set

Where "find_tuple(mixed_sets, first, index=0,1)" return the tuple belonging to mixed_sets that has "first" at the given index.

Probably you'll have to duplicate also mixed_sets and order one of the copies by the first element and the other one by the second element.

Or maybe you could play with dictionaries again. Adding to the values in "first_els" and "second_els" also a sorted list of tuples.

I don't know how the performances will scale, but I think that if the data is in the order of 2 millions you shouldn't have too much to worry about.

Bakuriu
  • 98,325
  • 22
  • 197
  • 231