Logistic Regression - Python?

Question

Could you briefly describe me what the below lines of code mean. This is the code of logistic regression in Python.

What means size =0.25 and random_state = 0 ? And what is train_test_split ? What was done in this line of code ?

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)

And what was done in these lines of code ?

logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)

score 2 · Answer 1 · answered Nov 27 '19 at 07:58

2

Have a look at the description of the function here:

random_state sets the seed for the random number generator to give you the same result with each run, especially useful in education settings to give everyone an identical result.
test_size refers to the proportion used in the test split, here 75% of the data is used for training, 25% is used for testing the model.

The other lines simply run the logistic regression on the training dataset. You then use the test dataset to check the goodness of the fitted regression.

answered Nov 27 '19 at 07:58

Uwe Ziegenhagen

683
2
20
35

So what can I receive when I take random_state = 0 ? why 0 no for example 45 ? or so on ? – dingaro Nov 27 '19 at 08:08
It is just the seed, a kind of starting value for random number generator. Usually this does not need to be set, it's just useful when you want to replicate the _exact_ same result. – Uwe Ziegenhagen Nov 27 '19 at 10:15

score 2 · Answer 2 · answered Nov 27 '19 at 07:58

What means size =0.25 and random_state = 0 ?

test_size=0.25 -> 25% split of training and test data.

random_state = 0 -> for reproducible results this can be any number.

What was done in this line of code ?

Splits X and y into X_train, X_test, y_train, y_test

And what was done in these lines of code ?

Trains the logistic regression model through the fit(X_train, y_train) and then makes predictions on the test set X_test.

Later you probably compare y_pred to y_test to see what the accuracy of the model is.

NeverHopeless · Answer 3 · 2019-11-27T09:40:39.163

This line line:

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)

divides your source into train and test set, 0.25 shows 25% of the source will be used for test and remaining will be used for training.

For, random_state = 0, here is a brief discussion. A part from above link:

if you use random_state=some_number, then you can guarantee that the output of Run 1 will be equal to the output of Run 2,

logistic_regression= LogisticRegression() #Creates logistic regressor

Calculates some values for your source. Recommended read

logistic_regression.fit(X_train,y_train)

A part from above link:

Here the fit method, when applied to the training dataset,learns the model parameters (for example, mean and standard deviation) .... It doesn't matter what the actual random_state number is 42, 0, 21, ... The important thing is that everytime you use 42, you will always get the same output the first time you make the split. This is useful if you want reproducible results,

Perform prediction on test set based on the learning from training set.

y_pred=logistic_regression.predict(X_test)

I read your link, however, could you tell me, what is the difference if i use for example random_state = 0 and random_state = I do not know 45 ? — dingaro, Nov 27 '19 at 08:23
I updated answer based on your comment. It doesn't matter which number do you select, but whatever you select you can keep using it if you want to reproduce result. No matter it is 0, 42, 45 or what else. — NeverHopeless, Nov 27 '19 at 09:42

score 1 · Answer 4 · answered Nov 27 '19 at 09:06

Based on the documentation:

test_size : float, int or None, optional (default=None)

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.

This gives you the split between your train data and test data, if you have in total 1000 data points, a test_size=0.25 would mean that you have:

750 data points for train
250 data points for test

The perfect size is still under discussions, for large datasets (1.000.000+ ) I currently prefer to set it to 0.1. And even before I have another validation dataset, which I will keep completly out until I decided to run the algorithm.

random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

For machine learning you should set this to a value, if you set it, you will have the chance to open your programm on another day and still produce the same results, normally random_state is also in all classifiers/regression models avaiable, so that you can start working and tuning, and have it reproducible,

To comment your regression:

logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)

Will load your Regression, for python this is only to name it
Will fit your logistic regression based on your training set, in this example it will use 750 datsets to train the regression. Training means, that the weights of logistic regression will be minimized with the 750 entries, that the estimat for your y_train fits
This will use the learned weights of step 2 to do an estimation for y_pred with the X_test

After that you can test your results, you now have a y_pred which you calculated and the real y_test, you can know calculate some accuracy scores and the how good the regression was trained.

score 0 · Answer 5 · answered Nov 27 '19 at 07:58

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)

Above line splits your data into training and testing data randomly

X is your dataset minus output variable
y is your output variable
test_size=0.25 means you are dividing data into 75%-25% where 25% is your testing dataset
random_state is used for generating same sample again when you run the code

Refer train-test-split documentation

Logistic Regression - Python?

5 Answers5