I have a list of phrases which states features of a house.
l = ["cats allowed", "dogs allowed", "pets allowed" , "24 hour doorman", "24 hour concierge", "24/7 concierge", "24hr doorman", ...]
.
The list has about 20000
words. I want to create cluster of similar words. Here, two clusters will be formed.
clstr1 = ["cats allowed", "dogs allowed", "pets allowed"]
clstr2 = ["24 hour doorman", "24 hour concierge", "24/7 concierge", "24hr doorman"]
I don't know total number of clusters. Till now, I can only understand that this can be done with k-means clustering algorithm
. But, for that I have to vectorize
the words. I am thinking to vectorize words using pre-trained google word2vec
model and then apply kmeans
from scikit learn
with a random number of clusters, n_clusters = 2000
possibly. Do I have any better way? Using nltk
or any other method?