0

I have a list of tuples that identifies pair wise relations between items.

[(1,2), (2,3), (3,1), (4,5), (5,4), (6,7)]

I want to look across the tuples and collapse them into unique collections as below (maybe as a hash map - any other efficient data structures?):

{a: (1,2,3), b: (4,5), c(6,7)}

Is there an algorithm that does this efficiently - I can only think of a brute force approach right now.

Looking to implement this in Python or R. My original examples has about 28 million tuples.

sriramn
  • 2,338
  • 4
  • 35
  • 45

1 Answers1

3

You basically want to find connected components. For that there is a function connected_components in scipy. You just have to reinterpret your data a bit:

l = [(1,2), (2,3), (3,1), (4,5), (5,4), (6,7)]

from scipy.sparse.csgraph import connected_components
from scipy.sparse import csr_matrix

# make list of unique elements
uniques = list(set(list(map(lambda a: a[0], l)) + list(map(lambda a: a[1], l)))) 
# reverse index to lookup elements index
unique2index = dict([(el, i) for (i, el) in enumerate(uniques)]) 

# prepare data for csr_matrix construction
data = [1 for x in l] # value 1 -- means edge
data_i = [unique2index.get(x[0]) for x in l] # source node
data_j = [unique2index.get(x[1]) for x in l] # target node

graphMatrix = csr_matrix((data, (data_i, data_j)),shape=(len(uniques), len(uniques)))
(numComponents, labels) = connected_components(graphMatrix) # here is the work done
# interpret labels back to original elements
components = [[uniques[j] for (j,x) in enumerate(labels) if x==i] for i in range(0, numComponents)] 

print(components) # [[1, 2, 3], [4, 5], [6, 7]] is printed
  • I want to note that the `components = ...` step is quadratic in the number of unique elements on input. Probably faster would be groupby `import itertools; list(map(lambda group: list([uniques[el[0]] for el in group[1]]),itertools.groupby(enumerate(labels), lambda x: x[1])))` but I didnt like to make another import, and the readability.... (I personally do not like python because one cannot write readable and briefly at once) – Martin Milichovsky Aug 24 '16 at 22:38