From the machine learning point of view, it absolutely doesn't matter how big the test set is. Why does it bother you? The real world looks the exact same way: you have N labeled samples for training, but there are N*10, N*1000, N*10^9 or more real cases out there so each (manually labeled, fixed) test set will necessarily be too small. The goal is to have a representative set, covering everything we expect in the real world, and if it means to have a YUGE™ test set, then the best thing you can do is to have a test set larger than training set.
In this particular case (and I'm not familiar with this particular task) it looks like the website you cited reads
There are three MONK's problems. The domains for all MONK's problems are the same (described below). One of the MONK's problems has noise added. For each problem, the domain has been partitioned into a train and test set.
The paper linked below
Wnek, J. and Michalski, R.S., "Comparing Symbolic and Subsymbolic Learning: Three Studies," in Machine Learning: A Multistrategy Approach, Vol. 4., R.S. Michalski and G. Tecuci (Eds.), Morgan Kaufmann, San Mateo, CA, 1993.
on page 20 reads as follows:

So in this particular scenario, the authors have chosen different training conditions, thus the three training sets. According to
Leondes, Cornelius T. Image processing and pattern recognition. Vol. 5. Elsevier, 1998, pp 307
they used all 432 available samples for training and trained on a subset of this data.
Having an overlap between training and test data is considered bad practice, but who am I to judge the research from 25 years ago in a field I'm not familiar with. Maybe it was too difficult to obtain more data and have a clean split.