What is the purpose of test data? Is it only to calculate accuracy of the classifier? I'm using Naive Bayes for sentiment analysis of tweets. Once I train my classifier using training data, I use test data just to calculate accuracy of the classifier. How can I use the test data to improve classifier's performance?
2 Answers
In doing general supervised machine learning, the test data set plays a critical role in determining how well your model is performing. You typically will build a model with say 90% of your input data, leaving 10% aside for testing. You then check the accuracy of that model by seeing how well it does against the 10% training set. The performance of the model against the test data is meaningful because the model has never "seen" this data. If the model be statistically valid, then it should perform well on both the training and test data sets. This general procedure is called cross validation and you can read more about it here.

- 502,043
- 27
- 286
- 360
-
Are you dividing your set into train-set+dev-test set like [this](http://stackoverflow.com/questions/10059594/a-simple-explanation-of-naive-bayes-classification) – CSK Mar 02 '15 at 06:29
-
I don't have any experience working with Naive Bayes, but I have worked extensively with decision trees (and a bit of SVM). The article to which you referred looks spot on for what you are doing. – Tim Biegeleisen Mar 02 '15 at 06:44
You don't -- like you surmise, the test data is used for testing, and mustn't be used for anything else, lest you skew your accuracy measurements. This is an important cornerstone of any machine learning -- you only fool yourself if you use your test data for training.
If you are considering desperate measures like that, the proper way forward is usually to re-examine your problem space and the solution you have. Does it adequately model the problem you are trying to solve? If not, can you devise a better model which captures the essence of the problem?
Machine learning is not a silver bullet. It will not solve your problem for you. Too many failed experiments prove over and over again, "garbage in -- garbage out".

- 175,061
- 34
- 275
- 318
-
So, increasing my training set is the only way to improve performance of my classifier? – CSK Mar 02 '15 at 05:48
-
A better model is often the only way to get substantial improvement, which seems to be what you are asking about. Use better features, or if you are lucky, drop noise features; or, try a completely different approach. – tripleee Mar 02 '15 at 05:50
-
I tried stripping nouns and symbols off the tweets but the results are not that good. If you don't mind can you please suggest any other improvements to get better features? – CSK Mar 02 '15 at 06:46
-
1Perhaps you could update your question with some background on the problem involving your Twitter data. That might help with a more directed answer. – Tim Biegeleisen Mar 02 '15 at 06:54