I'm working on an NLP project where I hope to use MaxEnt to categorize text into one of 20 different classes. I'm creating the training, validation and test sets by hand from administrative data that is hand written.
I would like to determine the sample size required for the classes in the training set and the appropriate size of the validation/testing set.
In the real world, the 20 outcomes are imbalanced. But I'm considering creating a balanced training set to help build the model.
So I have two questions:
How should I determine the appropriate sample size for each category in the training set?
Should the validation/testing sets be imbalance to reflect the conditions the model might encounter if faced with real world data?