Do I need to Standardize the data using StandardScalar
Generally speaking, this is called Features Scaling, and there are more than one Scaler for that purpose, in a nutshell:
StandardScaler
: usually your first option, it's very commonly used. It works via standardizing the data (i.e. centering them), that's to bring them to a STD=1
and Mean=0
. It gets affected by outliers, and should only be used if your data have Gaussian-Like Distribution.
MinMaxScaler
: usually used when you want to bring all your data point into a specific range (e.g. [0-1]). It heavily gets affected by outliers simply because it uses the Range.
RobustScaler
: It's "robust" against outliers because it scales the data according to the quantile range. However, you should know that outliers will still exist in the scaled data.
MaxAbsScaler
: mainly used for sparse data.
Unit Normalization
: basically it scales the vector for each sample to have unit norm, independently of the distribution of the samples.
Now as a rule of thumb, we usually scale features because of one (or more) of the following reasons:
- Some algorithms require features to be scaled, e.g. Neural Network. (to avoid for e.g. Vanishing Gradient Descent), another example is when we use
RBF
Kernel in SVM...etc.
- Feature Scaling improve/speed up convergence.
- When features highly vary in magnitudes, units and range (e.g. 5kg and 5000gms), because we don't want the algorithm to falsely think that one feature is more important (i.e. has higher impact on the model) than the other.
As you can see, features scaling has nothing to do with the number of classes you have in Y
.
..but for this iris dataset will need to classifiy 0 or 1 or 2 based
on specified...How this works? I know LR works by prediction YES OR NO
but here(iris) we have to predict 0 or 1 or 2
Well, in contrast to Binary Classification, this is called Multiclass Classification.
The basic idea here is Scikit LogisticRegresser uses One-vs-Rest (OvR) scheme -by default- to solve it (a.k.a One-vs-All) which works (in the simplest words I can think of) like this:
Train a logistic regression classifier for each class i
to predict probability that y = i
. On a new input x
, to make a prediction, pick the class i
that has the maximum likelihood (i.e. highest hypothesis result), in other words, it reduces the problem of multiclass classification to multiple binary classification problems, for more details look here.
If LogisticRegression
also works for multiple classification then
how can I optimize my above code for better prediction on other
multiclass datasets I want to try.
Well, you don't have to do any optimization, you're using Scikit Library abstractly, so it'll take care of the optimization, and indeed it does that via using a solver, for comparison between solvers, look here (I wrote it once on Stackoverflow).
Do I need to convert my y_train or do I need to do any type of
encoding etc for it to work?
For your case in particular (i.e. for Iris Dataset), the answer is No because it's all set ready for you, but if the values in the dependent variable (i.e. Y
) are not numerical, then you should convert them to numbers, for example if you have 4 classes, you denote each class by a number (e.g. 0, 1, 2, 3). (example of replacing the 0's and 1's by the words male and female)(you should do the opposite but you get the idea from there :D).
A really good reference I'd recommend you to start with, and it'll clear out all your doubts is this course by Professor Andrew NG.