Imbalanced data and sample size for large multi-class NLP classification

Question

I'm working on an NLP project where I hope to use MaxEnt to categorize text into one of 20 different classes. I'm creating the training, validation and test sets by hand from administrative data that is hand written.

I would like to determine the sample size required for the classes in the training set and the appropriate size of the validation/testing set.

In the real world, the 20 outcomes are imbalanced. But I'm considering creating a balanced training set to help build the model.

So I have two questions:

How should I determine the appropriate sample size for each category in the training set?

Should the validation/testing sets be imbalance to reflect the conditions the model might encounter if faced with real world data?

score 2 · Answer 1 · edited Apr 13 '17 at 12:44

In order to determine the sample size of your test set you could use Hoeffding's inequality.

Let E be the positive tolerance value and N the sample size of the data set.

Then we can compute Hoeffding's inequality, p = 1 - ( 2 * EXP( -2 * ( E^2 ) * N) ).

Let E = 0.05 (±5%) and N = 750, then p = 0.9530. This means that with a certainty of 95.3% your (in-sample) test error won't deviate more than 5% out of sample.

As for the sample size of the training and validation set there is an established convention to split the data as follows: 50% for training, and 25% each for validation and testing. The optimal size of those sets depends a lot on the the training set and the amount of noise in the data. For further information have a look at "Model Assessment and Selection" in "Elements of statistical learning".

As for your other question regarding imbalanced datasets have a look at this thread: https://stats.stackexchange.com/questions/6254/balanced-sampling-for-network-training

Imbalanced data and sample size for large multi-class NLP classification

1 Answers1