How to cluster multiple rows with at least one same value in python?

Question

I have a pandas dataframe (150,000 rows and 9 columns) like:

user|phone1|phone2|phone3    
----+------+------+------
   A|   123|  1111|66    
   B|   456|  1111|77    
   C|   123|  2222|77
   D|   456|  2222|88
   E|   789|  5555|0

User A has the same phone1 number with user C so A and C is a group.
User B has the same phone2 number with C, so B and C is a group.
Thus A,B,C is a group.

The logic is same for all the users. In this example, [A,B,C,D] is a group, because they have at least one same value by any two of them. [E] is another group.

How can I get final result like:

{group1:[A,B,C,D], group2:[E]}

This is my try:

First, group by each columns with same value,any put user as a group, like

list_1 (phone1) = [[A,C],[B,D],[E]]

list_2 (phone2) = [[A,B],[C,D],[E]]
For each item in list_1 search in list_2. If two items have same value, then add items from list_2 on items from list_1, eg, [A,C]+[A,B] and at last,pop [A,B] in list_2

This is my code:

for m in range(0,len(list_1)):
drop_list = []
for n in range(0,len(list_2)):
    if if_common(list_1[m], list_2[n]) == True:
        list_1[m] = list(set(list_1[m]+list_2[n]))
        drop_list.append(n)

for i in drop_list:
    list_2.pop(i)

but its too slow, I have nearly 100000 groups in each columns. Is there any quick way to realize this?

I can't see your image ([why not paste pictures of code/data?](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-on-so-when-asking-a-question)) but IIUC, you want to implement a connected component algorithm. This is not trivial. — pault, Jul 24 '18 at 14:03
Possible duplicate of [Python: simple list merging based on intersections](https://stackoverflow.com/questions/9110837/python-simple-list-merging-based-on-intersections) — pault, Jul 26 '18 at 17:16

score 1 · Accepted Answer · answered Jul 24 '18 at 14:21

Not sure how to do it with pandas api but there's a very efficient generic algorithm for it called disjoined-set (wikipedia).

To implement it you need two dictionaries. One maps row ids to row ids and another to map values to row ids. Each row I'm going to represent as values like ('phone1', '123'), ('phone2', '1111'), ....

Then we iterate through the data. We lookup all the column values in the second dictionary, if we have it already we try to add a link in the disjoined set.

All in all should look like:

disjoint_set = {}
value_lookup = {}
for row in range(len(list_1)):
  disjoint_set[row] = row  # Mark it as independent set.
  for key, value in list_1[row].items():  # not sure how to get key value with pandas
    if (key, value) not in value_lookup:
      value_lookup[(key, value)] = row
    else:
      other_row = value_lookup[(key, value)]
      actual_other = recursive_lookup(disjoint_set, other_row)
      actual_row = recursive_lookup(disjoint_set, row)
      disjoint_set[actual_row] = actual_other 

 def recursive_lookup(disjoint_set, row):
   if disjoin_set[row] != row:
     disjoint_set[row] = recursive_lookup(disjoint_set, disjoint_set[row])
   return disjoint_set[row]

At the end use recursive_lookup for each row you are interested in to get a representative from it's cluster. In other words, all rows that return the same value with recursive_lookup should be in the same cluster.

This should be fairly fast as you only need to go through the data once. The disjoint_sets work is O(1) amortized time so they shouldn't add to much overhead. Should be as fast as reading/writing the data.

How to cluster multiple rows with at least one same value in python?

1 Answers1