1

I have the following dataset:

    0   1   2   3
0   a   ❤     
1   b   ❤     
2   c       
3   d     ✨   
4   e   ❤     

I would like to perform clustering to group the ROWS which have something in common.

By using networkx in the following code, this is the result:

import networkx as nx
import matplotlib.pyplot as plt

G=nx.from_pandas_edgelist(df, 0, 1)
nx.draw(G, with_labels=True)
plt.show()

output: groups obtained with networkx

How can I also consider columns 2 and 3? Can I also do it without giving any priority to any particular column (example, I want column 2 to be equally important as column 1)?

yatu
  • 86,083
  • 12
  • 84
  • 139
ardito.bryan
  • 429
  • 9
  • 22

1 Answers1

1

Similarly to this answer, you could have each dataframe raw be a path, and look for the connected components. I've added a row without any common values with any other rows to better illustrate how this works:

print(df)
   0  1   2    3
0  a  ❤    
1  b  ❤    
2  c      
3  d    ✨  
4  e  ❤    
5  f      

So iterate over the dataframe rows, and add them as paths with nx.add_path:

my_list = df.values.tolist()
G=nx.Graph()
for path in my_list:
    nx.add_path(G, path)
components = list(nx.connected_components(G))

print(components)
[{'a', 'b', 'c', 'd', 'e', '✨', '❤', '', '', '', '', ''},
 {'f', '', '', ''}]

And now you can traverse the groups, and add each row to a new sublist in a nested list if it is a subset of the component:

groups = []
for component in components:
    group = []
    for path in my_list:
        if component.issuperset(path):
            group.append(path)
    groups.append(group)

In this case you'd have all rows except for the last grouped together, and the last in another gruop.

print(groups)

[[['a', '❤', '', ''],
  ['b', '❤', '', ''],
  ['c', '', '', ''],
  ['d', '', '✨', ''],
  ['e', '❤', '', '']],
 [['f', '', '', '']]]
yatu
  • 86,083
  • 12
  • 84
  • 139