0

I am trying to predict a cluster for a bunch of test documents in a trained k-means model using scikit-learn.

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(train_documents)
k = 10
model = KMeans(n_clusters=k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

The model is generated without any problem with 10 clusters. But when I try to predict a list of documents, I get an error.

predicted_cluster = model.predict(test_documents)

Error message:

ValueError: could not convert string to float...

Do I need to use PCA to reduce the number of features, or do I need to do preprocessing for the text document?

work_in_progress
  • 747
  • 1
  • 10
  • 27

1 Answers1

1

You need to transform the test_documents the same way in which train was transformed.

X_test = vectorizer.transform(test_documents)
predicted_cluster = model.predict(X_test)

Make sure you only call transform on the test documents and use the same vectorizer object which was used for fit() or fit_transform() on train documents.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • Or rather X_test = vectorizer.transform(test_documents) – pgrenholm Apr 22 '17 at 05:59
  • 1
    @pgrenholm yes. Corrected. Thanks. Even though I specifically stated not to do that, it seems that I myself made that mistake – Vivek Kumar Apr 22 '17 at 06:04
  • Yes. It worked for me. Thanks a lot. I got an error like: The incorrect number of features but the following post came to the rescue for that. http://stackoverflow.com/a/26943563/1269131 – work_in_progress Apr 22 '17 at 06:16
  • 1
    @SiMemon Ok. I thought that it will be understood because I used the same name in my code. But I should have explicitly mentioned about using the same object. – Vivek Kumar Apr 22 '17 at 06:23