2

I realized that all the Monk's problems have test set bigger than their train set. Why is this dataset organized like this? I think it's strange, even if it's a dummy dataset for models comparison.

Monk1 Train samples: 124 Test samples: 432

Monk2 Train samples: 169 Test samples: 432

Monk3 Train samples: 122 Test samples: 432

melfnt
  • 165
  • 1
  • 9

1 Answers1

2

From the machine learning point of view, it absolutely doesn't matter how big the test set is. Why does it bother you? The real world looks the exact same way: you have N labeled samples for training, but there are N*10, N*1000, N*10^9 or more real cases out there so each (manually labeled, fixed) test set will necessarily be too small. The goal is to have a representative set, covering everything we expect in the real world, and if it means to have a YUGE™ test set, then the best thing you can do is to have a test set larger than training set.

In this particular case (and I'm not familiar with this particular task) it looks like the website you cited reads

There are three MONK's problems. The domains for all MONK's problems are the same (described below). One of the MONK's problems has noise added. For each problem, the domain has been partitioned into a train and test set.

The paper linked below

Wnek, J. and Michalski, R.S., "Comparing Symbolic and Subsymbolic Learning: Three Studies," in Machine Learning: A Multistrategy Approach, Vol. 4., R.S. Michalski and G. Tecuci (Eds.), Morgan Kaufmann, San Mateo, CA, 1993.

on page 20 reads as follows:

enter image description here

So in this particular scenario, the authors have chosen different training conditions, thus the three training sets. According to

Leondes, Cornelius T. Image processing and pattern recognition. Vol. 5. Elsevier, 1998, pp 307

they used all 432 available samples for training and trained on a subset of this data.

Having an overlap between training and test data is considered bad practice, but who am I to judge the research from 25 years ago in a field I'm not familiar with. Maybe it was too difficult to obtain more data and have a clean split.

Pavel
  • 7,436
  • 2
  • 29
  • 42
  • From the machine learning point of view, it absolutely *does* matter how big the test set is because the amount of available labeled data is limited and is split between training, validation, and test sets. In practice, every sample that is reserved for testing is one less sample to be used for training. It is common practice to use an [80/20 training/test split](https://stackoverflow.com/questions/13610074/is-there-a-rule-of-thumb-for-how-to-divide-a-dataset-into-training-and-validatio). So it is natural to wonder why less that 30% of the data are used for training here. – bogatron Jan 28 '18 at 16:37
  • This is a question of data management and practical considerations that are specific to a particular task, not an algorithmic constraint or a theoretic requirement. I agree that from today's perspective, it appears to be a strange choice and I would not recommend such a split in any new task design. – Pavel Jan 28 '18 at 17:32