Getting indices while using train test split in scikit

Question

In order to split my data into train and test data separately, I'm using

sklearn.cross_validation.train_test_split function.

When I supply my data and labels as list of lists to this function, it returns train and test data in two separate lists.

I want to get the indices of the train and test data elements from the original data list.

Can anyone help me out with this?

Thanks in advance

Also answered [here](http://stackoverflow.com/questions/31521170/scikit-learn-train-test-split-with-indices). — Jost, Jan 19 '17 at 16:09

score 46 · Answer 1 · answered Feb 25 '16 at 09:18

46

You can supply the index vector as an additional argument. Using the example from sklearn:

import numpy as np
from sklearn.cross_validation import train_test_split
X, y,indices = (0.1*np.arange(10)).reshape((5, 2)),range(10,15),range(5)
X_train, X_test, y_train, y_test,indices_train,indices_test = train_test_split(X, y,indices, test_size=0.33, random_state=42)
indices_train,indices_test
#([2, 0, 3], [1, 4])

answered Feb 25 '16 at 09:18

Christian Hirsch

1,996
12
16

Very simple and useful answer, bravo! – mgokhanbakal Aug 16 '20 at 11:42
Great answer, much simpler and and more straightforward than many others. – Krutik Jun 15 '23 at 12:27

score 1 · Answer 2 · answered Feb 27 '21 at 15:09

Try the below solutions (depending on whether you have imbalance):

NUM_ROWS = train.shape[0]
TEST_SIZE = 0.3
indices = np.arange(NUM_ROWS)

# usual train-val split
train_idx, val_idx = train_test_split(indices, test_size=TEST_SIZE, train_size=None)

# stratified train-val split as per Response's proportion (if imbalance)
strat_train_idx, strat_val_idx = train_test_split(indices, test_size=TEST_SIZE, stratify=y)

Getting indices while using train test split in scikit

2 Answers2

Linked