0

I have a list of phrases which states features of a house.

l = ["cats allowed", "dogs allowed", "pets allowed" , "24 hour doorman", "24 hour concierge", "24/7 concierge", "24hr doorman", ...].

The list has about 20000 words. I want to create cluster of similar words. Here, two clusters will be formed.

clstr1 = ["cats allowed", "dogs allowed", "pets allowed"]

clstr2 = ["24 hour doorman", "24 hour concierge", "24/7 concierge", "24hr doorman"]

I don't know total number of clusters. Till now, I can only understand that this can be done with k-means clustering algorithm. But, for that I have to vectorize the words. I am thinking to vectorize words using pre-trained google word2vec model and then apply kmeans from scikit learn with a random number of clusters, n_clusters = 2000 possibly. Do I have any better way? Using nltk or any other method?

Abhinav Gupta
  • 435
  • 1
  • 4
  • 13
  • 2
    It seems like you are searching for clusters in terms of word meanings / semantics? In that case you need some way to compute distances between words in terms of their semantics. [WordNet](https://wordnet.princeton.edu/) might be a candidate to get these distances. Also, you need to filter out uninteresting terms like a, the, it etc. – languitar Mar 13 '17 at 13:39
  • Yes, but can you suggest some tutorial using wordnet. I am new to the field and have only use wordnet lemmetizer once. Thanks, for this. – Abhinav Gupta Mar 13 '17 at 13:47
  • Google just found [this](http://www.nltk.org/howto/wordnet.html). Have a look for `Bug 470` to find out how to compute distances. You then need to pass this distance function into kmeans, which seems to be possible with nltk: https://stackoverflow.com/questions/5529625/is-it-possible-to-specify-your-own-distance-function-using-scikit-learn-k-means – languitar Mar 13 '17 at 13:54

0 Answers0