ValueError while predicting a document in a scikit-learn k-means cluster

Question

I am trying to predict a cluster for a bunch of test documents in a trained k-means model using scikit-learn.

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(train_documents)
k = 10
model = KMeans(n_clusters=k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

The model is generated without any problem with 10 clusters. But when I try to predict a list of documents, I get an error.

predicted_cluster = model.predict(test_documents)

Error message:

ValueError: could not convert string to float...

Do I need to use PCA to reduce the number of features, or do I need to do preprocessing for the text document?

Vivek Kumar · Accepted Answer · 2017-04-22T06:25:39.420

1

You need to transform the test_documents the same way in which train was transformed.

X_test = vectorizer.transform(test_documents)
predicted_cluster = model.predict(X_test)

Make sure you only call transform on the test documents and use the same vectorizer object which was used for fit() or fit_transform() on train documents.

edited Apr 22 '17 at 06:25

answered Apr 22 '17 at 05:56

Vivek Kumar

35,217
8
109
132

Or rather X_test = vectorizer.transform(test_documents) – pgrenholm Apr 22 '17 at 05:59
1

@pgrenholm yes. Corrected. Thanks. Even though I specifically stated not to do that, it seems that I myself made that mistake – Vivek Kumar Apr 22 '17 at 06:04
Yes. It worked for me. Thanks a lot. I got an error like: The incorrect number of features but the following post came to the rescue for that. http://stackoverflow.com/a/26943563/1269131 – work_in_progress Apr 22 '17 at 06:16
1

@SiMemon Ok. I thought that it will be understood because I used the same name in my code. But I should have explicitly mentioned about using the same object. – Vivek Kumar Apr 22 '17 at 06:23

ValueError while predicting a document in a scikit-learn k-means cluster

1 Answers1