Make a column in test dataframe with the output of a regression model from the train set?

Question

I am doing a regression model with scikit learn and trying to predict a binary outcome (0,1).

X = tset.iloc[:,24:36].values
y = tset.iloc[:,20].values

from sklearn.linear_model import LogisticRegression
from sklearn import metrics

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

On my training dataframe, I have applied the logistic regression function above.

Now, I would like to take what the ML model has learned and in the test set, get the probability of each row being 1 as an additional column...so based on what the variables in row 1 are, in the new column it would have the probability that row 1 is 1 (in the binary classification), and so on for all the other rows in the test dataframe. If I could get the predicted output (0 or 1) that would be helpful too.

I can't seem to find any tutorials on this step... How do I go about doing this?

Item ID     Variables     Predicted Output     Probability of Output
1           ...           1                    .62
2           ...           0                    .55
3           ...           0                    .52
4           ...           1                    .65

Would want it to look somewhat like that. ^

score 0 · Answer 1 · answered May 03 '20 at 01:45

0

All sklearn models have a predict method that you can call. In your case

 preds = logreg.predict(X_test)

If you want the probabilities of each class call predict_proba which returns a (batch, number clases) array

answered May 03 '20 at 01:45

umbreon29

223
2
8

Thanks, the predict_proba is what I was looking for. How do I go about getting that into a column on the test dataframe though? I tried: preds = logreg.predict_proba(X_test) and then test['new column'] = preds ... And got an error. Sorry, Python noob. – fbnhost1 May 03 '20 at 01:51
the shape of your preds is `(test,2)`, first column is the first class prob and the second one is for the second class, so try `test['firs_class_prob'] = preds[:,0]` – umbreon29 May 03 '20 at 02:04
When I try that I get the error "ValueError: Length of values does not match length of index". – fbnhost1 May 03 '20 at 02:06
len(X_test) is 2831 and test (the dataframe) is 1887. I'm not sure how the discrepancy there happened, as I split the df using train_test_split as shown in the OP. – fbnhost1 May 03 '20 at 02:07
but how this test dataframe is generated? – umbreon29 May 03 '20 at 02:11
Hmm...I'm not sure. When I was looking through my code I wasn't clear on the answer to that either, I just assumed that was a byproduct of the train_split_test function. – fbnhost1 May 03 '20 at 02:18
I just closed/reopened Jupyter and re-ran all my code up to that point. I think I must have generated the test DF at some point earlier in the evening while trying some other stuff out. Now that I have restarted Jupyter and re-run all the cleaned code, when I try your ```test['firs_class_prob'] = preds[:,0]``` recommendation, I am getting NameError: name 'test' is not defined. – fbnhost1 May 03 '20 at 02:23
@fbnhost1 probably because, as shown in your code above, it is `tset`, not `test`. – desertnaut May 03 '20 at 10:11

Make a column in test dataframe with the output of a regression model from the train set?

1 Answers1