I have a pandas dataframe (150,000 rows and 9 columns) like:
user|phone1|phone2|phone3
----+------+------+------
A| 123| 1111|66
B| 456| 1111|77
C| 123| 2222|77
D| 456| 2222|88
E| 789| 5555|0
- User A has the same
phone1
number with user C so A and C is a group. - User B has the same
phone2
number with C, so B and C is a group. - Thus A,B,C is a group.
The logic is same for all the users. In this example, [A,B,C,D]
is a group, because they have at least one same value by any two of them. [E]
is another group.
How can I get final result like:
{group1:[A,B,C,D], group2:[E]}
This is my try:
First, group by each columns with same value,any put user as a group, like
list_1 (phone1) = [[A,C],[B,D],[E]]
list_2 (phone2) = [[A,B],[C,D],[E]]
For each item in
list_1
search inlist_2
. If two items have same value, then add items fromlist_2
on items fromlist_1
, eg,[A,C]+[A,B]
and at last,pop[A,B]
inlist_2
This is my code:
for m in range(0,len(list_1)):
drop_list = []
for n in range(0,len(list_2)):
if if_common(list_1[m], list_2[n]) == True:
list_1[m] = list(set(list_1[m]+list_2[n]))
drop_list.append(n)
for i in drop_list:
list_2.pop(i)
but its too slow, I have nearly 100000 groups in each columns. Is there any quick way to realize this?