0

I am trying to implement something similar to @bens code here in R. I am working with unstructured news articles and want to do clustering on them after doing topic modeling I tried to execute the code provided by @ben and it worked. I wanted to know how can I divide the data in train and test and predict the clusters for test data, then evaluate how the test data was clustered may be using Mean avg precision.

I know this becomes semi-unsupervised and not unsupervised but I want to try it to see the results.

Community
  • 1
  • 1
Karan Kothari
  • 91
  • 2
  • 12
  • Advice on how to methodically set this up is best received on [stats.SE]. For here, you need to provide some code of what you've tried (an implementation) and point out where precisely you have trouble with the implementation of your method (wich requires having a method after all). Please extend the question in such a way, that it becomes eligible for either of the sites and request migration (via the "flag" menu) if necessary. – AlexR Aug 13 '16 at 13:42
  • Ok.. Thanks I'll repost it there – Karan Kothari Aug 13 '16 at 13:59
  • 1
    I've started a migration request. This way this question will be moved to CV without a duplicate being created. – AlexR Aug 13 '16 at 14:01
  • Thanks @AlexR But can you help me with this question? – Karan Kothari Aug 13 '16 at 14:11
  • I can when I have the time. Likely that will be some time tomorrow if noone else has answered until then. – AlexR Aug 13 '16 at 14:12
  • Ok.. I will wait for it then.. Thanks @AlexR but was the question clear enough on what I want to do? – Karan Kothari Aug 13 '16 at 14:13
  • I'll comment if I need further clarification - as I said, no time now ;) – AlexR Aug 13 '16 at 14:15
  • 1
    I'm not sure this question will survive on [stats.SE] in its current form. It needs to be clearer, self-contained, & not code review or about how to implement a given procedure in software. If the only real question is how to divide data into train & test, & how to assess the validity of a clustering thereby, that would be a good question, but a duplicate. You should search the site & read the existing information. Then you could post a question that is specific to what you still need to know. – gung - Reinstate Monica Aug 13 '16 at 16:42
  • @gung can you provide the link to the question you are saying is duplicate? It might solve my doubt and question.. Thanks – Karan Kothari Aug 14 '16 at 00:46
  • There are lots of threads on splitting data. Try working through [this search](http://stats.stackexchange.com/search?tab=relevance&q=split%20data%20train%20test%20is%3aquestion). There are also several threads on cross validating clusterings. Read through some of [these](http://stats.stackexchange.com/questions/tagged/cross-validation+clustering?sort=votes&pageSize=30). If you still have a question after that, you can formulate a clear, concrete question, & state what you've learned & what you still need to know. – gung - Reinstate Monica Aug 14 '16 at 01:38

1 Answers1

0

Semi-supervised means that you'd optimize (!) the clustering to produce the "optimum" results on the datat where you have labels, and expect it to then also cluster the unlabeled data well. This is hard to get working, depending on your data. For example with k-means you would likely optimize k to match the number of known clusters, but what about the not yet known clusters?

If you just want to see zow well your clustering method works, you do not need a train-test split. That serves the purpose of avoiding overfitting when optimizing parameters (and to that extend, to be overly optimistic on your real performance). When not using the labels in the method (as in clustering) and also not doing so for parameterization, then you can simply perform what is called "external evaluation". You re-add the labels to your data set and evsaluate how well the clustering agrees with your labels.

But beware, clusters can be good even if they do not agree with your labels. For example, your label migjt be "olympics", but the clustering produce a cluster for "swimming". It's a good cluster, even if it splits up your provided label (one may even argue that it is good because it does so, it improves your label!).

If all your data is labeled, always prefer classification! Don't attempt to optimize clustering to simulate classification.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194