0

I have some questions.

  1. What is the function of randperm in training data and testing? It's like this example Multi-Class SVM( one versus all) I still don't understand why it should use randperm?

  2. If I have a data like an alphabet handwritten, can I use randperm like the example link for my case?

Is there any resource/paper that can used as background for this issue? I need some help, thank you.

Community
  • 1
  • 1
  • If your data is position-dependant, e.g. if you have video from a moving vehicle and the terrain is changing, then you'll have to shuffle your data to get a representative split of testing and training data from your dataset. With that said, this might be a better fit for http://stats.stackexchange.com/ – alrikai Jun 18 '13 at 20:34

1 Answers1

0

I can only answer 1.

The point of a training set is to develop a generalization, which you then test with the test set to test your generalization. If you tweak anything about your learning algorithm and re-train/re-test without creating a new training and test set, you're really just learning the test set, not developing a generalization.

If your results are stable across the shuffling of the training and test data, you are more likely to have learned a good generalization.

This is called the repeated holdout method - see http://www.umiacs.umd.edu/~joseph/classes/459M/year2010/Chapter5-testing-4on1.pdf for a brief discussion of several methods. As alrikai suggested in the comments, this is the sort of material discussed on stats.stackexchange.com. For example: https://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set

Community
  • 1
  • 1
Tony Lee
  • 5,622
  • 1
  • 28
  • 45
  • How if the distribution of the data is not perfect,i mean the probability result of randperm. eg the data A numbered 3 in testing data, and there is no data B in testing data only found in training data? – Rina Santi Jun 18 '13 at 22:11
  • You bring up a good point, you need training data and test data for each letter. With straight up random selection, there is a chance you won't have enough in a given set and a smarter random partitioning could mitigate this. Better might be just a verification after the partitioning and repartition if it's not good (eg., not enough examples of an 'a'). – Tony Lee Jun 18 '13 at 22:28