I recently started working on Document clustering using SciKit module in python. However I am having a hard time understanding the basics of document clustering.
What I know ?
- Document clustering is typically done using TF/IDF. Which essentially converts the words in the documents to vector space model which is then input to the algorithm.
- There are many algorithms like k-means, neural networks, hierarchical clustering to accomplish this.
My Data :
- I am experimenting with linkedin data, each document would be the linkedin profile summary, I would like to see if similar job documents get clustered together.
Current Challenges:
- My data has huge summary descriptions, which end up becoming 10000's of words when I apply TF/IDF. Is there any proper way to handle this high dimensional data.
- K - means and other algorithms requires I specify the no. of clusters ( centroids ), in my case I do not know the number of clusters upfront. This I believe is a completely unsupervised learning. Are there algorithms which can determine the no. of clusters themselves?
- I've never worked with document clustering before, if you are aware of tutorials , textbooks or articles which address this issue, please feel free to suggest.
I went through the code on SciKit webpage, it consists of too many technical words which I donot understand, if you guys have any code with good explanation or comments please share. Thanks in advance.