1

I know I'm asking a lot of questions with one question but these were the doubts I got when I was using Logistic Regression for Iris Dataset

This is my code for using LogisticRegression on iris dataset.

iris = datasets.load_iris()
X, y = iris.data, iris.target
x_train, x_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state= 81,
                                                           test_size=0.3)
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
pred = logreg.predict(x_test)
accuracy_score(y_test, pred) # this gives accuracy
0.95555

I know that LogisticRegression works by prediction either 1 or 0 in the result but for this iris dataset will need to classifiy 0 or 1 or 2 based on specified.

Q) Do I need to Standardize the data using StandardScalar?

Q) How this works? I know LR works by prediction YES OR NO but here(iris) we have to predict 0 or 1 or 2

Q) If LogisticRegression also works for multiple classification then how can I optimize my above code for better prediction on other multiclass datasets I want to try.

Q) Do I need to convert my y_train or do I need to do any type of encoding etc for it to work?

I would really appreciate if anyone can help me figure these out. Any good references are also appreciated.

Jeeth
  • 2,226
  • 5
  • 24
  • 60

2 Answers2

4

Do I need to Standardize the data using StandardScalar

Generally speaking, this is called Features Scaling, and there are more than one Scaler for that purpose, in a nutshell:

  1. StandardScaler: usually your first option, it's very commonly used. It works via standardizing the data (i.e. centering them), that's to bring them to a STD=1 and Mean=0. It gets affected by outliers, and should only be used if your data have Gaussian-Like Distribution.
  2. MinMaxScaler: usually used when you want to bring all your data point into a specific range (e.g. [0-1]). It heavily gets affected by outliers simply because it uses the Range.
  3. RobustScaler: It's "robust" against outliers because it scales the data according to the quantile range. However, you should know that outliers will still exist in the scaled data.
  4. MaxAbsScaler: mainly used for sparse data.
  5. Unit Normalization: basically it scales the vector for each sample to have unit norm, independently of the distribution of the samples.

Now as a rule of thumb, we usually scale features because of one (or more) of the following reasons:

  1. Some algorithms require features to be scaled, e.g. Neural Network. (to avoid for e.g. Vanishing Gradient Descent), another example is when we use RBF Kernel in SVM...etc.
  2. Feature Scaling improve/speed up convergence.
  3. When features highly vary in magnitudes, units and range (e.g. 5kg and 5000gms), because we don't want the algorithm to falsely think that one feature is more important (i.e. has higher impact on the model) than the other.

As you can see, features scaling has nothing to do with the number of classes you have in Y.


..but for this iris dataset will need to classifiy 0 or 1 or 2 based on specified...How this works? I know LR works by prediction YES OR NO but here(iris) we have to predict 0 or 1 or 2

Well, in contrast to Binary Classification, this is called Multiclass Classification.

The basic idea here is Scikit LogisticRegresser uses One-vs-Rest (OvR) scheme -by default- to solve it (a.k.a One-vs-All) which works (in the simplest words I can think of) like this:

Train a logistic regression classifier for each class i to predict probability that y = i. On a new input x, to make a prediction, pick the class i that has the maximum likelihood (i.e. highest hypothesis result), in other words, it reduces the problem of multiclass classification to multiple binary classification problems, for more details look here.


If LogisticRegression also works for multiple classification then how can I optimize my above code for better prediction on other multiclass datasets I want to try.

Well, you don't have to do any optimization, you're using Scikit Library abstractly, so it'll take care of the optimization, and indeed it does that via using a solver, for comparison between solvers, look here (I wrote it once on Stackoverflow).


Do I need to convert my y_train or do I need to do any type of encoding etc for it to work?

For your case in particular (i.e. for Iris Dataset), the answer is No because it's all set ready for you, but if the values in the dependent variable (i.e. Y) are not numerical, then you should convert them to numbers, for example if you have 4 classes, you denote each class by a number (e.g. 0, 1, 2, 3). (example of replacing the 0's and 1's by the words male and female)(you should do the opposite but you get the idea from there :D).


A really good reference I'd recommend you to start with, and it'll clear out all your doubts is this course by Professor Andrew NG.

Yahya
  • 13,349
  • 6
  • 30
  • 42
  • I just have one doubt.There's one parameter in `LogisticRegression()` called `multi_class`? Can I use it? What is the use of it? I was thinking it could be related to multi class classification. – Jeeth Oct 01 '18 at 11:17
  • 2
    @user2475 This `multi_class` parameter is nothing but choosing between `OvR` scheme or `multinomial` scheme. Default is `ovr` which I explained in my answer, you really don't need to worry about the second option but in case you're curios, look [here](http://www.statisticssolutions.com/mlr/) and [here](https://en.wikipedia.org/wiki/Multinomial_logistic_regression). – Yahya Oct 01 '18 at 11:21
1

"Do I need to Standardize the data using StandardScalar?"

The purpose of normalizing the dataset is done so that the model converges faster, as far as this problem is considered it is relatively simple, hence standardizing isn't necessary, you could do it regardless

"How this works? I know LR works by prediction YES OR NO but here(iris) we have to predict 0 or 1 or 2"

Basically for multiclass regression multiples models are created, 3 in this case, each model will predict YES or NO for each class, so basically you test the test time dataset with all classes, and the class that has the highest probability for yes is then returned to you

"Do I need to convert my y_train or do I need to do any type of encoding etc for it to work?"

No you may pass the y_train data as it is

"What are all the scoring parameters to use when we use multiple classification with LogisticRegression. How are these scoring parameters differs from single class classification (o or 1)"

I really didnt get this question but you are supposed to create a logistic regression model like this: logreg = LogisticRegression(C=1e5, solver='lbfgs', multi_class='multinomial')

An example of your exact application can be found here http://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html

Imtinan Azhar
  • 1,725
  • 10
  • 26