1

enter image description hereI am trying to apply LogisticRegression model from sklearn to the MNIST dataset and i have split the training - test data into a 70-30 split.

However, when i simply say model.fit(train_x, train_y) it takes a very long time.

I have added no parameters when initiating logisticregression.

code :

import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_mldata
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

import tempfile

test_data_home = tempfile.mkdtemp()
mnist = fetch_mldata('MNIST original', data_home = test_data_home)


x_train, x_test, y_train, y_test = train_test_split(mnist.data, mnist.target, test_size = 0.30, random_state = 0)

lr = LogisticRegression(penalty = 'l2')
lr.fit(x_train, y_train)
alift
  • 1,855
  • 2
  • 13
  • 28
TheNoob
  • 185
  • 1
  • 12
  • Can you share the code you are running? And also the specs of your machine? – alift Apr 26 '19 at 01:12
  • @alift added the image for specs and code – TheNoob Apr 26 '19 at 01:14
  • will be helpful to see the code, and also tell approximately how long it takes for getting the results? 10 mins? 1 hour? etc – alift Apr 26 '19 at 01:16
  • @alift took like 20 minutes to fit the model to the data. Those 2 lines are literally what i have done till now. – TheNoob Apr 26 '19 at 01:17
  • Well, your specs seem good to me, but 20 minutes is too much. I am afraid I cannot help you because you did not share the code, cannot find the issue with just two lines you have shared. If you do not feel comfortable share the code, have a look here, and see what you are doing differently, hopefully, that helps: https://scikit-learn.org/stable/auto_examples/linear_model/plot_sparse_logistic_regression_mnist.html . Good luck – alift Apr 26 '19 at 01:26
  • @alift my apologies, I just assumed these 2 lines of code were the one you were looking for. I just want to understand why it takes so long. Also, i think i have followed similar process as given in the link. Thank You for all your help. – TheNoob Apr 26 '19 at 01:35
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/192403/discussion-between-alift-and-thenoob). – alift Apr 26 '19 at 01:48

2 Answers2

1

The issue that you have brought up seems fairly vague, but I am fairly sure your logistic regression is not converging. I am not particularly sure why you are including a "L2" penalty term now unless you are worried about overfitting. Anyhow, if you look at the sklearn docs, it says:

Algorithm to use in the optimization problem .

For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones. For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes. ‘newton-cg’, ‘lbfgs’ and ‘sag’ only handle L2 penalty, whereas ‘liblinear’ and ‘saga’ handle L1 penalty. Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from sklearn.preprocessing.

I would immediately suggest you add the parameter `solver = sag' (or any other solver that can handle L2 penalty) because the docs clearly says that only certain solvers can handle L2 penalty, and the default solver liblinear only handles L1 penalty. There is this really great post on the solvers for logistic regression that you can look at for your dataset:
Solvers for Logistic Regression

Keep in mind that L2 and L1 regularization are to deal with overfitting and as such, you can even change the C parameter in your lr definition. Please look at the sklearn docs for further information. Hope this helps.

0

First of all, MINST is not a binary classification, but a multiclass. So regarding the documentations in scikit-learn:

multi_class : str, {‘ovr’, ‘multinomial’, ‘auto’}, default: ‘ovr’ If the option chosen is ‘ovr’, then a binary problem is fit for each label. For ‘multinomial’ the loss minimised is the multinomial loss fit across the entire probability distribution, even when the data is binary. ‘multinomial’ is unavailable when solver=’liblinear’. ‘auto’ selects ‘ovr’ if the data is binary, or if solver=’liblinear’, and otherwise selects ‘multinomial’.

You need to emphasize it in your model creation.

Since MINST has the features with the same magnitude, I believe if you explicitly mention your solver as saga which is converging faster than other solvers.

So I would go as the example of Scikitlearn here set the training parameters, and change your code to :

lr = LogisticRegression(C=50. / train_samples,
                         multi_class='multinomial',
                         penalty='l1', solver='saga', tol=0.1)
lr.fit(x_train, y_train)
alift
  • 1,855
  • 2
  • 13
  • 28