2

I have a list of sentences. I want to cluster my sentences on similarity using the WMD (word mover's distance). I am using a word2vec model from gensim to create embeddings for my words.

The clustering algorithms I know (nltk, sklearn) use number vectors as input so I need to give the sentences as an array (or list) of the embeddings of the words in them. I think I can use the nltk clustering methods with a custom distance function. I want to use the WMD as his custom function. But the WMD function of gensim uses a 2 lists of strings as input.

Is there a prebuild WMD function that uses the embeddings and not the strings as input? Or is there a clustering (kmeans or something else) that can handle lists of strings as input and can have the WMD as custom distance function?

Thanks

1 Answers1

1

Clustering algorithms don't take strings (or lists-of-string-tokens) as input. Rather, you'll always have to preprocess your texts into some sort of numerical data, for use by a clustering algorithm.

But as the documentation for scikit-learn's clustering notes:

One important thing to note is that the algorithms implemented in this module can take different kinds of matrix as input. All the methods accept standard data matrices of shape [n_samples, n_features]. These can be obtained from the classes in the sklearn.feature_extraction module. For AffinityPropagation, SpectralClustering and DBSCAN one can also input similarity matrices of shape [n_samples, n_samples]. These can be obtained from the functions in the sklearn.metrics.pairwise module.

To further clarify what it's saying: if the algorithm is taking input of shape [n_samples, n_features], then it's assuming each of your texts has already been reduced to a fixed-size vector (of n_features size). A calculation like like "Word Mover's Distance" doesn't inherently reduce a single text to a fixed-size vector – as you've noted, WMD only works on pairs of texts. So you'd not be able to directly use WMD to convert single texts, alone, into the n_features-sized vectors that these algorithms. (But also: see the 'final note' at bottom.)

On the other hand, if the algorithm is able to take input of shape [n_samples, n_samples], then it needs all possible pairwise similarities. That, you can pre-calculate using WMD, in a couple of loops over your texts (or using the scikit-learn utility function pairwise_distances()). And note, these scikit-learn algorithms need affinity='precomputed' specified, and expect similarity values, where the more-similar pairs return higher values, rather than distance values, where more-similar pairs return higher values. So you'd need to convert WMD distances to some sort of similarity-scale. (One simple way that may suffice is to negate the distance value.)

For example, if texts includes all your (tokenized) texts, and wv your gensim word-vectors:

import numpy as np

dist_matrix = np.zeros((len(texts), len(texts)))
for i in range(len(texts)):
    for j in range(len(texts)):
        if i == j:
            continue  # self-distance is 0.0
        if i > j:
            dist_matrix[i, j] = dist_matrix[j, i]  # re-use earlier calc
        dist_matrix[i, j] = wv.wmdistance(texts[i], texts[j])
# this negation simply makes largest dists into smallest (most-negative) sims
# you could also consider instead a calc that maps all dists to [-1.0, 1.0]
sim_matrix = -dist_matrix  

That sim_matrix should work inside one of the scikit-learn algorithms that accepts the affinity='precomputed' option.

But beware: WMD is quite expensive to calculate (especially on longer texts), so depending on the size of your set of sentences, just precomputing all the pairwise distances may be very time-consuming.

A final note: you could also consider, like with the similarity-matrix, converting your individual texts into fixed-sized vectors using WMD calculations, but by converting each text into the vector of its distances (or similarities) to all other texts.

That is, essentially use the same pairwise-process above - but then pass the resulting matrix into one of the algorithms that take a [n_samples, n_features]-shaped input.

I'm unsure if this will work as well as an algorithm designed for such an input, but as long as you've spent the time calculating all pairwise distances (or similarities), it could be worth a try.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Using your snippet of code I get the error: ValueError: setting an array element with a sequence. I think this is because that the input has to be a nice (square) array, so I get again the same problem that not every sentence has the same amount of words. So the issue boils down to the same problem I had with using [n_samples, n_features] as input. – Jari Peeperkorn Oct 25 '19 at 11:53
  • `pairwise_distances()` does not require a square-shaped input, but it will return a square-shaped result, so the most likely cause of your error is supplying an inappropriate value for `texts`. What did you put in the `texts` before trying the code I suggested? How was it created, how large is it? (It could be a list or array, but can't be an iterator/generator/non-list-sequence.) – gojomo Oct 25 '19 at 17:19
  • texts type is numpy.ndarray, texts[0] type is numpy.str_, texts[0][0] is str – Jari Peeperkorn Nov 06 '19 at 08:44
  • a bit late, but for me the sklearn implementation also does not work - sklearn executes 'check_paired_arrays' which tries to convert each element to float :-( – Sebastian Hätälä Jun 29 '20 at 20:44
  • @gojomo Thanks for your detailed answer, but unfortunately `pariwise_distances()` does not work with text data. Maybe mapping via a dictionary works? – Farhood ET Sep 05 '20 at 20:36
  • 1
    @FarhoodET - I thought that worked when written (since the supplied lambda-function didn't require strict arrays-of-floats) but maybe I was mistaken or `scipy` has become stricter recently. I've replaced that usage with an explicit set of nested loops to calculate all distances & create the same matrix (as per my other answer https://stackoverflow.com/a/44490916/130288). Beware: calculating WMD is quite a bit slower than simpler vector-to-vector comparisons, especially on longer texts. – gojomo Sep 08 '20 at 17:26