5

I have a dataset that has a unique identifier and other features. It looks like this

ID      LenA TypeA LenB TypeB Diff Score Response
123-456  51   M     101  L     50   0.2   0
234-567  46   S     49   S     3    0.9   1
345-678  87   M     70   M     17   0.7   0

I split it up into training and test data. I am trying to classify test data into two classes from a classifier trained on training data. I want the identifier in the training and testing dataset so I can map the predictions back to the IDs.
Is there a way that I can assign the identifier column as a ID or non-predictor like we can do in Azure ML Studio or SAS?

I am using the DecisionTreeClassifier from Scikit-Learn. This is the code I have for the classifier.

from sklearn import tree

clf = tree.DecisionTreeClassifier()
clf = clf.fit(traindata, trainlabels)

If I just include the ID into the traindata, the code throws an error:

ValueError: invalid literal for float(): 123-456

Minu
  • 450
  • 1
  • 7
  • 21

2 Answers2

6

Not knowing how you made your split I would suggest just making sure the ID column is not included in your training data. Something like this perhaps:

X_train, X_test, y_train, y_test = test_train_split(df.ix[:, ~df.columns.isin(['ID', 'Response'])].values, df.Response)

That will split only the values from the DataFrame not in ID or Response for the X values, and split Response for the y values.

But you will still not be able to use the DecisionTreeClassifier with this data as it contains strings. You will need to convert any column with categorical data, i.e. TypeA and TypeB to a numerical representation. The best way to do this in my opinion for sklearn is with the LabelEncoder. Using this will convert the categorical string labels ['M', 'S'] into [1, 2] which can be implemented with the DecisionTreeClassifier. If you need an example take a look at Passing categorical data to sklearn decision tree.

Update

Per your comment I now understand that you need to map back to the ID. In this case you can leverage pandas to your advantage. Set ID as the index of your data and then do the split, that way you will retain the ID value for all of your train and test data. Let's assume your data are already in a pandas dataframe.

df = df.set_index('ID')
X_train, X_test, y_train, y_test = test_train_split(df.ix[:, ~df.columns.isin(['Response'])], df.Response)
print(X_train)
         LenA TypeA  LenB TypeB  Diff  Score
ID
345-678    87     M    70     M    17    0.7
234-567    46     S    49     S     3    0.9
Community
  • 1
  • 1
Grr
  • 15,553
  • 7
  • 65
  • 85
  • How is that going to help me map the predictions to the ID? If I split the data using the trian_test_split function using the above code, I still won't have ID in the dataset, will I? – Minu Apr 23 '17 at 23:11
  • Lets say I run `clf.predict(X_test)` here, are my results going to have the same index as X_test? If so, I can merge the results dataframe and X_test dataframe on index, right? – Minu May 01 '17 at 12:13
  • @Minu They will not share the exact index. For example the index of X_test in my example will be `Index(['345-678'], dtype='object', name='ID')` where as the results of `predict` will not have an explicit index. However, the order will still be the same so you could "join" them, just not with then `pandas.DataFrame.join` method. Something like this would work: `X_test['predicted'] = results` – Grr May 01 '17 at 14:59
  • In that case, there's no necessity to set the IDs as index, correct? I can concat the X_test data and predicted results based on the order of the rows even without index being IDs. – Minu May 03 '17 at 13:53
  • Also, when I try to OneHotEncode with index = ID, I get an error: `IndexError: arrays used as indices must be of integer (or boolean) type` – Minu May 03 '17 at 14:06
  • You did not understand the question. The question about how to join prediction back into id, not split data set with id. (ex some use-case: to submit to kaggle, your submission is a pair of id-prediction) – Haha TTpro Jan 19 '22 at 07:48
0

The pandas dataframe keep its order when you do transformation (except join/merge that create/drop row).

So, Here is step-by-step:

  1. create df_test dataframe with 'id' column
  2. create df_test2 that don't have 'id' column df_test2 = df_test.drop(["id"], axis=1)
  3. Input df_test2 into model for prediction pred = model.predict(df_test2)
  4. create df_pred_final from 'id' column from df_test df_pred_final = df_test[["id"]]
  5. add column 'target' into df_pred_final. The pair id-target should be map correctly df_pred_final["target"] = pred

Please take a look at my kaggle notebook. You might get the idea. https://www.kaggle.com/tthien/20210412-complex-drop-c10-c2

Haha TTpro
  • 5,137
  • 6
  • 45
  • 71