4

I've been having a bit of a debate with my adviser about this issue, and I'd like to get your opinion on it.

I have a fairly large dataset that I've used to build a classifier. I have a separate, smaller testing dataset that was obtained independently from the training set (in fact, you could say that each sample in either set was obtained independently). Each sample has a class label, along with metadata such as collection date and location.

There is no sample in the testing set that has the same metadata as any sample in the training set (as each sample was collected at a different location or time). However, it is possible that the feature vector itself could be identical to some sample in the training set. For example, there could be two virus strains that were sampled in Africa and Canada, respectively, but which both have the same protein sequence (the feature vector).

My adviser thinks that I should remove such samples from the testing set. His reasoning is that these are like "freebies" when it comes to testing, and may artificially boost the reported accuracy.

However, I disagree and think they should be included, because it may actually happen in the real world that the classifier sees a sample that it has already seen before. To remove these samples would bring us even further from reality.

What do you think?

Mike
  • 93
  • 1
  • 5

4 Answers4

4

It would be nice to know if you're talking about a couple of repetitions in million samples or 10 repetitions in 15 samples.

In general I don't find what you're doing reasonable. I think your advisor has a very good point. Your evaluation needs to be as close as possible to using your classifier outside your control -- You can't just assume your going to be evaluated on a datapoint you've already seen. Even if each data point is independent, you're going to be evaluated on never-before-seen data.

My experience is in computer vision, and it would be very highly questionable to train and test with the same picture of a one subject. In fact I wouldn't be comfortable training and testing with frames of the same video (not even the same frame).

EDIT:

There are two questions:

  1. The distribution permits that these repetitions naturally happen. I believe you, you know your experiment, you know your data, you're the expert.

  2. The issue that you're getting a boost by doing this and that this boost is possibly unfair. One possible way to address your advisor's concerns is to evaluate how significant a leverage you're getting from the repeated data points. Generate 20 test cases 10 in which you train with 1000 and test on 33 making sure there are not repetitions in the 33, and another 10 cases in which you train with 1000 and test on 33 with repetitions allowed as they occur naturally. Report the mean and standard deviation of both experiments.

carlosdc
  • 12,022
  • 4
  • 45
  • 62
  • I don't think I'm assuming that I'm going to be evaluated on a datapoint I've already seen. I have a set of about 1000 training samples that were taken at various dates around the world, and a set of 33 testing samples. The testing samples were collected using approximately the same process as the training samples, and essentially the same process as any sample that will presented to the classifier in the future. Lets say we have a distribution that we're sampling from to get both datasets. If it just so happens that we get the same feature vector by chance, to me that still reflects reality. – Mike Dec 05 '13 at 06:34
1

It depends... Your adviser suggested the common practice. You usually test a classifier on samples which have not been used for training. If the samples of the test set matching the training set are very few, your results are not going to have statistical difference because of the reappearance of the same vectors. If you want to be formal and still keep your logic, you have to prove that the reappearance of the same vectors has no statistical significance on the testing process. If you proved this theoretically, I would accept your logic. See this ebook on statistics in general, and this chapter as a start point on statistical significance and null hypothesis testing.

Hope I helped!

Pantelis Natsiavas
  • 5,293
  • 5
  • 21
  • 36
1

In as much as the training and testing datasets are representative of the underlying data distribution, I think it's perfectly valid to leave in repetitions. The test data should be representative of the kind of data you would expect your method to perform on. If you genuinely can get exact replicates, that's fine. However, I would question what your domain is where it's possible to generate exactly the same sample multiple times. Are your data synthetic? Are you using a tiny feature set with few possible values for each of your features, such that different points in input space map to the same point in feature space?

The fact that you're able to encounter the same instance multiple times is suspicious to me. Also, if you have 1,033 instances, you should be using far more than 33 of them for testing. The variance in your test accuracy will be huge. See the answer here.

Community
  • 1
  • 1
Ben Allison
  • 7,244
  • 1
  • 15
  • 24
0

Having several duplicate or very similar samples seems somewhat analogous to the distribution of the population you're attempting to classify being non-uniform. That is, certain feature combinations are more common than others, and the high occurrence of them in your data is giving them more weight. Either that, or your samples are not representative.

Note: Of course, even if a population is uniformly distributed there is always some likelihood of drawing similar samples (perhaps even identical depending on the distribution).

You could probably make some argument that identical observations are a special case, but are they really? If your samples are representative it seems perfectly reasonable that some feature combinations would be more common than others (perhaps even identical depending on your problem domain).

mattnedrich
  • 7,577
  • 9
  • 39
  • 45