I have a very small list of short strings which I want to (1) cluster and (2) use that model to predict which cluster a new string belongs to.
Running the first part works fine, getting a prediction for the new string does not.
First Part
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
# List of
documents_lst = ['a small, narrow river',
'a continuous flow of liquid, air, or gas',
'a continuous flow of data or instructions, typically one having a constant or predictable rate.',
'a group in which schoolchildren of the same age and ability are taught',
'(of liquid, air, gas, etc.) run or flow in a continuous current in a specified direction',
'transmit or receive (data, especially video and audio material) over the Internet as a steady, continuous flow.',
'put (schoolchildren) in groups of the same age and ability to be taught together',
'a natural body of running water flowing on or under the earth']
# 1. Vectorize the text
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(documents_lst)
print('tfidf_matrix.shape: ', tfidf_matrix.shape)
# 2. Get the number of clusters to make .. (find a better way than random)
num_clusters = 3
# 3. Cluster the defintions
km = KMeans(n_clusters=num_clusters, init='k-means++').fit(tfidf_matrix)
clusters = km.labels_.tolist()
print(clusters)
Which returns:
tfidf_matrix.shape: (8, 39)
[0, 1, 0, 2, 1, 0, 2, 0]
Second Part
The failing part:
predict_doc = ['A stream is a body of water with a current, confined within a bed and banks.']
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(predict_doc)
print('tfidf_matrix.shape: ', tfidf_matrix.shape)
km.predict(tfidf_matrix)
The error:
ValueError: Incorrect number of features. Got 7 features, expected 39
FWIW: I somewhat understand that the training and predict have a different amount of features after vectorizing ...
I am open to any solution including changing from kmeans to an algorithm more suitable for short text clustering.
Thanks in advance