Based on the documentation:
test_size : float, int or None, optional (default=None)
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.
This gives you the split between your train data and test data, if you have in total 1000 data points, a test_size=0.25
would mean that you have:
- 750 data points for train
- 250 data points for test
The perfect size is still under discussions, for large datasets (1.000.000+ ) I currently prefer to set it to 0.1. And even before I have another validation dataset, which I will keep completly out until I decided to run the algorithm.
random_state : int, RandomState instance or None, optional
(default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
For machine learning you should set this to a value, if you set it, you will have the chance to open your programm on another day and still produce the same results, normally random_state is also in all classifiers/regression models avaiable, so that you can start working and tuning, and have it reproducible,
To comment your regression:
logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)
- Will load your Regression, for python this is only to name it
- Will fit your logistic regression based on your training set, in this example it will use 750 datsets to train the regression. Training means, that the weights of logistic regression will be minimized with the 750 entries, that the estimat for your
y_train
fits
- This will use the learned weights of step 2 to do an estimation for
y_pred
with the X_test
After that you can test your results, you now have a y_pred
which you calculated and the real y_test
, you can know calculate some accuracy scores and the how good the regression was trained.