3

I've been working with Weka for awhile now, and in my research on it, I find that a lot of code examples use test and training sets. For instance, with Discretization and Bayesian Networks,their examples are almost always shown using test and training sets. I may be missing some fundamental understanding of data processing here, but I don't understand why this seems to always be the case. I am using Discretization and Bayesian Networks in a project and for both of them, I have not used test or training sets, and do not see why I would need to either. I am performing cross validation on the BayesNet, so I am testing its accuracy. Am I misunderstanding what test and training sets are used for??? Oh and please use the simplest of terminology; I'm still not very experienced with the world of data processing.

Ketchy108
  • 61
  • 1
  • 2
  • 5

2 Answers2

5

The idea behind training and test sets is to test the generalization error. That is, if you used just one data set, you could achieve perfect accuracy by simply learning this set (this is what nearest neighbour classifiers do, IBk in Weka). In general, this is not what you want however -- the machine learning algorithm should learn the general concept behind the example data that you give it. A way of testing whether this happens is to use separate data for training and testing.

If you're using cross-validation, you're using separate training and test sets. This is simply a way of coming up with the partition of your entire data set into training and test. If you do 10 fold cross-validation for example, your entire data is partitioned into 10 sets of equal size. Nine of these are combined and used for training, the remaining one for testing. Then the process is repeated with nine different sets combined for training and so on until all the ten individual partitions will have been used for testing.

So training/test sets and cross-validation are conceptually doing the same thing, cross-validation simply takes a more rigorous approach by averaging over the entire data set.

Lars Kotthoff
  • 107,425
  • 16
  • 204
  • 204
  • quite old question but got a question regarding this. So if I have training, development and test sets, what is the role of development set to WEKA? How do I use my development set in WEKA? – samsamara May 26 '14 at 09:09
  • Not sure what you mean by "development set". You would usually have training, test and verification sets. – Lars Kotthoff May 26 '14 at 09:21
1

Training data refers to the data used to "build the model". For example, it you are using the algorithm J48 (a tree classifier) to classify instances, the training data will be used to generate the tree that will represent the "learned concept" that should be a generalization of the concept. It means that the learned rules, generated trees, the adjusted neural network, or whatever; will be able to get new (unseen) instances and classify them correctly (the "learned concept" does not depends on the training data).

The test sets are a percentage of the data that will be used to test whether the model has learned the concept properly (it is independent of the training data).

In WEKA you can run an execution splitting your data set into trainig data (to build the tree in the case of J48) and test data (to test the model in order to determine that the concept has been learned). For example, you can use 60% of the data for training and 40% for testing (determine how much data is needed for training and testing is one of the key problems of data mining).

But I would recommend you to have a quick look to cross-validation, that is a robust testing method that is implemented in WEKA. It has been explained quite well here: https://stackoverflow.com/a/10539247/1565171

If you have more questions just leave a comment.

Community
  • 1
  • 1
arutaku
  • 5,937
  • 1
  • 24
  • 38