2

Suppose I have the dataset in the following format:

col1    col2     col3      col4         col5 (to be predicted)
12      13       4         primary      12 
1       15       2         secondary    13
5       7        8         primary      18
14      12       44        college      6

col5 needs to be predicted for some test data using col1, col2, col3 and col4

During training, col1, col2, col3 can be feeded as such in an array to the classifier but how to feed col4. I am aware that this is categorical and need to be converted to numeric type, but even after assigning some number, it will still remain as nominal type.

So if primary=1, secondary=2 and college=3, the numbers 1,2 and 3 cant be compared as per their magnitude because they are still like labels, with no numerical significance.

So how should I proceed after this step... should they be normalized ? or any further should be done ?

mach
  • 318
  • 1
  • 5
  • 13

1 Answers1

1

You should use One Hot Encoding in such cases. Every possible categorial value creates new binary feature.

One Hot Encoding for Machine learning

Ibraim Ganiev
  • 8,934
  • 3
  • 33
  • 52
  • @ Olologin but if the number of values of a categorical feature is large, then I would have to add that many columns in the dataset. Wont that increase the complexity for the classifier ? – mach Oct 03 '15 at 10:23
  • Yes, you should add all binary values, and yes, it will increase complexity of classifier, but if you have enough data it should not be a problem. Also, you can compress feature space somehow, with FeatureHashing for example. – Ibraim Ganiev Oct 03 '15 at 10:25