1

I have a dataset with an ID column for each sample as in this example:

id score1 score2 score3
1  0.41   0.37   0.04
2  0.19   0.33   0.277
3  0.21   0.33   0.037
4  0.49   0.23   0.378
5  0.51   0.78   0.041

To fit and predict a ML classifier on this data, I have to remove the ID column from the data

X = np.array(df.drop(['id'], 1)) 
X_train, X_test = model_selection.train_test_split(X, test_size=0.2)`
clf.fit(X_train)
pred = clf.predict(X_test)

I am wondering how can I recover the ID in prediction results, so I can identify each sample if it was correctly classified or not ? because I already know the correct label of samples. Or, if there is a way to keep the ID (could be numeric or non-numeric) in the training ?

I found this related question, but I can't understand what to do because they are talking about other things like Census Estimator, etc. and I'm running a very simple Python script with numpy and scikit-learn libraries.

1 Answers1

7

You can use the features of Pandas to do this. I used iris dataset and the code below is worked fine. label column is the actual labels.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
df = pd.read_csv("ids.csv", sep=",")
clf = LogisticRegression()

X = df
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train_data = X_train.iloc[:,1:5]
X_test_data = X_test.iloc[:,1:5]
clf.fit(X_train_data, y_train)
pred = clf.predict(X_test_data)
sub = pd.DataFrame(data=X_test)
sub['pred'] = pred
sub.head() #Shows the first few rows

The result looks like this

id   f1   f2   f3   f4   label  pred
144  6.8  3.2  5.9  2.3   2     2
68   5.8  2.7  4.1  1.0   1     1
10   4.9  3.1  1.5  0.1   0     0
137  6.3  3.4  5.6  2.4   2     2
46   4.8  3.0  1.4  0.3   0     0
JISHAD A.V
  • 361
  • 1
  • 9