I think it is more natural to think of the problem as a graph.
You can assume for example that apple
is node 0, and banana
is node 1 and the first list indicates there is an edge between 0 to 1.
so first convert the labels to numbers:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
le.fit(['apple','banana','orange','rice','potatoes'])
now:
l=[['apple','banana'],
['apple','orange'],
['banana','orange'],
['rice','potatoes'], #I deleted orange as edge is between 2 points, you can transform the triple to 3 pairs or think of different solution
['potatoes','rice']]
convert the labels to numbers:
edges=[le.transform(x) for x in l]
>>edges
[array([0, 1], dtype=int64),
array([0, 2], dtype=int64),
array([1, 2], dtype=int64),
array([4, 3], dtype=int64),
array([3, 4], dtype=int64)]
now, start to build the graph and add the edges:
import networkx as nx #graphs package
G=nx.Graph() #create the graph and add edges
for e in edges:
G.add_edge(e[0],e[1])
now you can use the connected_component_subgraphs
function to analyze connected vertices.
components = nx.connected_component_subgraphs(G) #analyze connected subgraphs
comp_dict = {idx: comp.nodes() for idx, comp in enumerate(components)}
print(comp_dict)
output:
{0: [0, 1, 2], 1: [3, 4]}
or
print([le.inverse_transform(v) for v in comp_dict.values()])
output:
[array(['apple', 'banana', 'orange']), array(['potatoes', 'rice'])]
and those are your 2 clusters.