5

I have a Dataframe with 1 column (+the index) containing lists of sublists or elements. I would like to detect common elements in the lists/sublists and group the lists with at least 1 common element in order to have only lists of elements without any common elements. The lists/sublists are currently like this (exemple for 4 rows):

                 Num_ID
Row1   [['A1','A2','A3'],['A1','B1','B2','C3','D1']]`

Row2   ['A1','E2','E3']

Row3   [['B4','B5','G4'],['B6','B4']]

Row4   ['B4','C9']

n lists with no common elements (example for the first 2):

['A1','A2','A3','B1','B2','C3','D1','E2','E3']
['B4','B5','B6','C9','G4']
yatu
  • 86,083
  • 12
  • 84
  • 139
Jon1
  • 65
  • 4
  • Try out this https://stackoverflow.com/questions/15503479/grouping-lists-by-common-elements. it's similar almost – Arpit Jun 20 '19 at 11:04

1 Answers1

5

You can use NetworkX's connected_components method for this. Here's how I'd approach this adapting this solution:

import networkx as nx
from itertools import combinations, chain

df= pd.DataFrame({'Num_ID':[[['A1','A2','A3'],['A1','B1','B2','C3','D1']], 
                            ['A1','E2','E3'], 
                            [['B4','B5','G4'],['B6','B4']], 
                            ['B4','C9']]})

Start by flattening the sublists in each list:

L = [[*chain.from_iterable(i)] if isinstance(i[0], list) else i 
       for i in df.Num_ID.values.tolist()]

[['A1', 'A2', 'A3', 'A1', 'B1', 'B2', 'C3', 'D1'],
 ['A1', 'E2', 'E3'],
 ['B4', 'B5', 'G4', 'B6', 'B4'],
 ['B4', 'C9']]

Given that the lists/sublists have more than 2 elements, you can get all the length 2 combinations from each sublist and use these as the network edges (note that edges can only connect two nodes):

L2_nested = [list(combinations(l,2)) for l in L]
L2 = list(chain.from_iterable(L2_nested))

Generate a graph, and add your list as the graph edges using add_edges_from. Then use connected_components, which will precisely give you a list of sets of the connected components in the graph:

G=nx.Graph()
G.add_edges_from(L2)
list(nx.connected_components(G))

[{'A1', 'A2', 'A3', 'B1', 'B2', 'C3', 'D1', 'E2', 'E3'},
 {'B4', 'B5', 'B6', 'C9', 'G4'}]
yatu
  • 86,083
  • 12
  • 84
  • 139
  • 1
    Thanks a lot. As a newbie I've to digest all the concepts here but a quick test with my data seems very good. Could you please tell me what should be done to put the resulting sets in a Dataframe with columns Set1, Set2, Set3... ? Thanks again, really impressed by your quick answer – Jon1 Jun 20 '19 at 12:45
  • 1
    Everything is fine. Grateful for eternity :-) – Jon1 Jun 20 '19 at 19:36