2

Let's say I have a list of lists of words, for example

[['apple','banana'],
 ['apple','orange'],
 ['banana','orange'],
 ['rice','potatoes','orange'],
 ['potatoes','rice']]

The set is much bigger. I want to cluster the words that words usually existing together will have the same cluster. So in this case the clusters will be ['apple', 'banana', 'orange'] and ['rice','potatoes'].
What is the best approach to archive this kind of clustering?

Patrik Valkovič
  • 706
  • 7
  • 26
  • 1
    What have you tried so far? Generally, "How do I" type questions are considered off-topic. This forum is more intended for helping when you run into a specific issue. – G. Anderson Oct 11 '18 at 16:30
  • Maybe [sklearn clustering](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) is a place to start – G. Anderson Oct 11 '18 at 16:31
  • I was looking to sklearn clustering, but I have no values on which I can perform the clustering. I tried also big table, where I noted how many each word exists with different words (so for example t[apple,banana]=1, t[rice,potatoes]=2) but make it for all words (its in fact cartesian product) is slow and I am unable to process it in reasonable amount of time. – Patrik Valkovič Oct 11 '18 at 16:44
  • Specifically for word vectors, you can try [NLTK](https://www.nltk.org/api/nltk.cluster.html) or [Gensim](https://radimrehurek.com/gensim/tutorial.html) for creating word/document vectors, or in some [other](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) way translate your text vectors to be machine readable – G. Anderson Oct 11 '18 at 16:53

3 Answers3

1

I think it is more natural to think of the problem as a graph.

You can assume for example that apple is node 0, and banana is node 1 and the first list indicates there is an edge between 0 to 1.

so first convert the labels to numbers:

from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
le.fit(['apple','banana','orange','rice','potatoes'])

now:

l=[['apple','banana'],
 ['apple','orange'],
 ['banana','orange'],
 ['rice','potatoes'], #I deleted orange as edge is between 2 points, you can  transform the triple to 3 pairs or think of different solution
 ['potatoes','rice']]

convert the labels to numbers:

edges=[le.transform(x) for x in l]

>>edges

[array([0, 1], dtype=int64),
array([0, 2], dtype=int64),
array([1, 2], dtype=int64),
array([4, 3], dtype=int64),
array([3, 4], dtype=int64)]

now, start to build the graph and add the edges:

import networkx as nx #graphs package
G=nx.Graph() #create the graph and add edges
for e in edges:
    G.add_edge(e[0],e[1])

now you can use the connected_component_subgraphs function to analyze connected vertices.

components = nx.connected_component_subgraphs(G) #analyze connected subgraphs
comp_dict = {idx: comp.nodes() for idx, comp in enumerate(components)}
print(comp_dict)

output:

{0: [0, 1, 2], 1: [3, 4]}

or

print([le.inverse_transform(v) for v in comp_dict.values()])

output:

[array(['apple', 'banana', 'orange']), array(['potatoes', 'rice'])]

and those are your 2 clusters.

Binyamin Even
  • 3,318
  • 1
  • 18
  • 45
  • I was thinking abbout transform the problem into graph, but I have thousands of records and each record consist of at at least 10 words. So it can happen that the whole graph can be in one component. I more need clustering in the sence that only words that occurs frequently together will be considered. I was also thinking about searching complete graph, but I also need to take into account how offen the combinations of words occurs. – Patrik Valkovič Oct 11 '18 at 18:15
0

It will be more meaningful to look for frequent itemsets instead.

If you cluster such short sets of words, everything will be connected at usually just a few levels: nothing in common, one element in common, two elements in common. That is too coarse to be usable for clustering. You'll get everything or nothing connected, and i0the results may be highly sensitive to data changes and ordering.

So abandoned the paradigm of partitioning the data - look for frequent combinations instead.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
-1

So, after lots of Googling around, I figured out that I, in fact, can't use clustering techniques because I lack feature variables on which I can cluster the words. If I make a table where I note how often each word exists with other words (in fact cartesian product) is in fact adjacency matrix and clustering doesn't work well on it.

So, the solution I was looking for is graph community detection. I used igraph library (or the python's python-ipgraph wrapper) to find the clusters and it runs very well and fast.

More informations:

Patrik Valkovič
  • 706
  • 7
  • 26
  • You could trivially encode the data: apple is column 0, banana is column 1, ... and then each set is a binary vector. That works, but the quality will be bad for other reasons (the data is too coarse). – Has QUIT--Anony-Mousse Oct 25 '18 at 10:04
  • @Anony-Mousse But I need to cluster the words, not the sets itself. If I use this approach to the words, I will get the adjacency matrix as described above. In fact, community detection has quite good results. I don't understand why I get minus points for this answer (and without any comment). – Patrik Valkovič Oct 25 '18 at 10:59
  • If you transpose the matrix you could cluster words based on the sets they occur in, too. But that may cause some difficult to handle bias based on your corpus. But your original question title suggest you wanted to cluster the sets... – Has QUIT--Anony-Mousse Oct 26 '18 at 08:41