Clustering algorithms don't take strings (or lists-of-string-tokens) as input. Rather, you'll always have to preprocess your texts into some sort of numerical data, for use by a clustering algorithm.
But as the documentation for scikit-learn
's clustering notes:
One important thing to note is that the algorithms implemented in this
module can take different kinds of matrix as input. All the methods
accept standard data matrices of shape [n_samples, n_features]
. These
can be obtained from the classes in the sklearn.feature_extraction
module. For AffinityPropagation
, SpectralClustering
and DBSCAN
one can
also input similarity matrices of shape [n_samples, n_samples]
. These
can be obtained from the functions in the sklearn.metrics.pairwise
module.
To further clarify what it's saying: if the algorithm is taking input of shape [n_samples, n_features]
, then it's assuming each of your texts has already been reduced to a fixed-size vector (of n_features
size). A calculation like like "Word Mover's Distance" doesn't inherently reduce a single text to a fixed-size vector – as you've noted, WMD only works on pairs of texts. So you'd not be able to directly use WMD to convert single texts, alone, into the n_features
-sized vectors that these algorithms. (But also: see the 'final note' at bottom.)
On the other hand, if the algorithm is able to take input of shape [n_samples, n_samples]
, then it needs all possible pairwise similarities. That, you can pre-calculate using WMD, in a couple of loops over your texts (or using the scikit-learn
utility function pairwise_distances()
). And note, these scikit-learn
algorithms need affinity='precomputed'
specified, and expect similarity values, where the more-similar pairs return higher values, rather than distance values, where more-similar pairs return higher values. So you'd need to convert WMD distances to some sort of similarity-scale. (One simple way that may suffice is to negate the distance value.)
For example, if texts
includes all your (tokenized) texts, and wv
your gensim
word-vectors:
import numpy as np
dist_matrix = np.zeros((len(texts), len(texts)))
for i in range(len(texts)):
for j in range(len(texts)):
if i == j:
continue # self-distance is 0.0
if i > j:
dist_matrix[i, j] = dist_matrix[j, i] # re-use earlier calc
dist_matrix[i, j] = wv.wmdistance(texts[i], texts[j])
# this negation simply makes largest dists into smallest (most-negative) sims
# you could also consider instead a calc that maps all dists to [-1.0, 1.0]
sim_matrix = -dist_matrix
That sim_matrix
should work inside one of the scikit-learn
algorithms that accepts the affinity='precomputed'
option.
But beware: WMD is quite expensive to calculate (especially on longer texts), so depending on the size of your set of sentences, just precomputing all the pairwise distances may be very time-consuming.
A final note: you could also consider, like with the similarity-matrix, converting your individual texts into fixed-sized vectors using WMD calculations, but by converting each text into the vector of its distances (or similarities) to all other texts.
That is, essentially use the same pairwise-process above - but then pass the resulting matrix into one of the algorithms that take a [n_samples, n_features]
-shaped input.
I'm unsure if this will work as well as an algorithm designed for such an input, but as long as you've spent the time calculating all pairwise distances (or similarities), it could be worth a try.