I am kind of new to machine learning, and I am working on a classification/regression problem.
In the dataset, there is a weather feature takes a few categorical values, as: Sunny, Rainy, Windy, Cloudy, etc.
There are two optional ways to transform this feature,
1.Give each category a numeric index, as
date weather indexedWeather
2017-11-01 Sunny 0
2017-11-02 Cloudy 1
2017-11-03 Snow 3
2017-11-04 Cloudy 1
2017-11-05 Windy 2
2017-11-06 Sunny 0
2017-11-07 Snow 3
2017-11-08 Cloudy 1
Spark MLLib has an VectorIndexer
tranformer to do this task
2.Tranform this feature into a binary vector:
date weather indexedWeather
2017-11-01 Sunny 1 0 0 0
2017-11-02 Cloudy 0 1 0 0
2017-11-03 Snow 0 0 1 0
2017-11-04 Cloudy 0 1 0 0
2017-11-05 Windy 0 0 0 1
2017-11-06 Sunny 1 0 0 0
2017-11-07 Snow 0 0 1 0
2017-11-08 Cloudy 0 1 0 0
Spark MLLib doesn't provide a tranformer for this kind of task.
Which one is preferred? It looks that these both two options are used in practice , but in my opinion, I would prefer the second option, but i would hear from you guys's understanding.