5

I'm following this tutorial from Scikit learn on text clustering using K-Means: http://scikit-learn.org/stable/auto_examples/text/document_clustering.html

In the example, optionally LSA (using SVD) is used to perform dimensionality reduction.

Why is this useful ? The number of dimensions (features) can already be controlled in the TF-IDF vectorizer using the "max_features" parameter.

I understand that LSA (and LDA) are also topic modelling techniques. The difference with clustering is that documents belong to multiple topics, but only to one cluster. I do not understand why LSA would be used in the context of K-Means clustering.

Example code:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

documents = ["some text", "some other text", "more text"]

tfidf_vectorizer = TfidfVectorizer(max_df=0.5, max_features=10000, min_df=2, stop_words='english', use_idf=True)
X = tfidf_vectorizer.fit_transform(documents)

svd = TruncatedSVD(1000)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)
Xnew = lsa.fit_transform(X)

model = KMeans(n_clusters=10, init='k-means++', max_iter=100, n_init=1, verbose=False)
model.fit(Xnew)
Niko Nelissen
  • 120
  • 2
  • 10

2 Answers2

6

LSA transforms the bag-of-words feature space to a new feature-space (with ortho-normal set of basis vectors) where each dimension represents a latent concept (represented as a linear combination of words in the original dimension).

As with PCA, a few top eigenvectors generally capture most of the variance in the transformed feature space and the other eigenvectors mainly represent the noise in the dataset, hence, the top eigenvectors in the LSA featurespace can be thought of to be likely to capture most of the concepts defined by the words in the original space.

Hence, reduction in dimension in the transofrmed LSA feature space is likely to be much more effective than in the original BOW tf-idf feature space (which simply chops off the less frequent / unimportant words), thereby leading to better quality data after the dimensionality reduction and likely to improve the quality of the clusters.

Additionally, dimension reduction helps to fight the curse of dimensionality problem (e.g., that arises while the distance computation in k-means).

Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
5

There is a paper that shows that PCA eigenvectors are good initializers for K-Means.

Controlling the dimension with the max_features parameter is equivalent to cutting off the vocabulary size which has negative effects. For example if you set max_features to 10 the model will work with the most common 10 words in the corpus and ignore the rest.

elyase
  • 39,479
  • 12
  • 112
  • 119
  • Thanks @elyase, this helps. So using LSA (SVD) will result in better clusters compared to simple using max_features=10 in TFIDF. Is LSA (SVD) similar to PCA or how should I see this ? What is the relation between LSA and PCA ? – Niko Nelissen Feb 22 '17 at 16:24
  • I guess you can find an answer here http://stats.stackexchange.com/questions/65699/lsa-vs-pca-document-clustering. It's a little different when you do dimensionality reduction. PCA calculates covariance matrix of your input array. For SVD (or LSA), it uses scipy in order to calculate decomposition matrix right away (X = U * S * V.T). For scikit-learn specifically, you can't feed in sparse matrix to PCA model so if you have tf-idf matrix, using SVD might be a better choice. – titipata Feb 22 '17 at 16:52