88

How do I get the original indices of the data when using train_test_split()?

What I have is the following

from sklearn.cross_validation import train_test_split
import numpy as np
data = np.reshape(np.randn(20),(10,2)) # 10 training examples
labels = np.random.randint(2, size=10) # 10 labels
x1, x2, y1, y2 = train_test_split(data, labels, size=0.2)

But this does not give the indices of the original data. One workaround is to add the indices to data (e.g. data = [(i, d) for i, d in enumerate(data)]) and then pass them inside train_test_split and then expand again. Are there any cleaner solutions?

Dave Neeley
  • 3,526
  • 1
  • 24
  • 42
CentAu
  • 10,660
  • 15
  • 59
  • 85
  • 3
    Note also [sklearn.model_selection.ShuffleSplit](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html) and [sklearn.model_selection.StratifiedShuffleSplit](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html). – Jost Jan 19 '17 at 16:15

5 Answers5

130

You can use pandas dataframes or series as Julien said but if you want to restrict your-self to numpy you can pass an additional array of indices:

from sklearn.model_selection import train_test_split
import numpy as np
n_samples, n_features, n_classes = 10, 2, 2
data = np.random.randn(n_samples, n_features)  # 10 training examples
labels = np.random.randint(n_classes, size=n_samples)  # 10 labels
indices = np.arange(n_samples)
(
    data_train,
    data_test,
    labels_train,
    labels_test,
    indices_train,
    indices_test,
) = train_test_split(data, labels, indices, test_size=0.2)
ogrisel
  • 39,309
  • 12
  • 116
  • 125
  • 1
    as of NumPt v1.1, the 3rd line should be `data = np.reshape(np.random.randn(20),(10,2))`; last line should be `... train_test_split(data, labels, indices, test_size=0.2)` – pepe May 01 '16 at 23:50
  • 16
    Actually this should be the accepted response, because it does not use any additional package but sklearn. And it gives more control over what is going on that with pandas. – Irene Feb 28 '17 at 15:56
  • @ogrisel Hi I have a similiar issue, can you kindly check https://stackoverflow.com/questions/48734942/how-to-maintain-the-x-axis-value-from-the-train-test-split-when-plotting-y-train –  Feb 11 '18 at 18:52
  • hey what is n_class on your third line?? number of class, what do you mean by class?? Thanks in advance. – Yun Tae Hwang Jan 28 '19 at 02:51
  • This is number of target label classes to generate random classification labels in the `labels` variable. – ogrisel Jan 30 '19 at 10:55
  • It also works if you added more than the indices. For example with token_lists that also need to be split: `x1, x2, y1, y2, idx1, idx2, t1, t2 = train_test_split(data, labels, indices, token_lists, test_size=0.2)`. – questionto42 Sep 15 '21 at 08:46
59

Scikit learn plays really well with Pandas, so I suggest you use it. Here's an example:

In [1]: 
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
data = np.reshape(np.random.randn(20),(10,2)) # 10 training examples
labels = np.random.randint(2, size=10) # 10 labels

In [2]: # Giving columns in X a name
X = pd.DataFrame(data, columns=['Column_1', 'Column_2'])
y = pd.Series(labels)

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=0)

In [4]: X_test
Out[4]:

     Column_1    Column_2
2   -1.39       -1.86
8    0.48       -0.81
4   -0.10       -1.83

In [5]: y_test
Out[5]:

2    1
8    1
4    1
dtype: int32

You can directly call any scikit functions on DataFrame/Series and it will work.

Let's say you wanted to do a LogisticRegression, here's how you could retrieve the coefficients in a nice way:

In [6]: 
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model = model.fit(X_train, y_train)

# Retrieve coefficients: index is the feature name (['Column_1', 'Column_2'] here)
df_coefs = pd.DataFrame(model.coef_[0], index=X.columns, columns = ['Coefficient'])
df_coefs
Out[6]:
            Coefficient
Column_1    0.076987
Column_2    -0.352463
Julien Marrec
  • 11,605
  • 4
  • 46
  • 63
  • 1
    Also, it seems you either have an issue with the code in your question or you might be using deprecated versions of scikit and numpy (np.randn doesn't exist on mine, and `test_size` is used and not `size` in train_test_split) – Julien Marrec Jul 20 '15 at 16:52
  • 2
    I edited my answer to show how to retrieve the coefficients with the feature names coming from the pandas dataframe. Might save you a little of time in the future. – Julien Marrec Jul 20 '15 at 17:00
  • Hi @Julien Marrec I tried applying your solution but it did not work, can you kindly check it here https://stackoverflow.com/questions/48734942/how-to-maintain-the-x-axis-value-from-the-train-test-split-when-plotting-y-train –  Feb 11 '18 at 19:05
  • 4
    What if I have already split my data without first creating the indices? – Samuel Nde Mar 19 '19 at 22:54
  • 1
    How does this answer the original question? It appears you are just creating a new dataframe and applying the indexes of the original dataframe to the newly created array. When you randomize your data using train test split, you are shuffling the rows, and simply applying indexes from a prior dataframe to shuffled data doesn't allow you to access the indexes from your original data accurately. Am I missing something? – devdrc Jan 13 '20 at 21:46
  • @devdrc I'm only passing the NAMES of the columns of the original dataset to create the index of the df_coefs, cf the comment I had right above: `Retrieve coefficients: index is the feature name ([0,1] here)`. Granted, had I used actual names instead of letting it default to an integer-based index, it would have been clearer for sure. I'll update the answer. – Julien Marrec Jan 14 '20 at 08:20
  • As far as why it's answering the question, train_test_split when passed dataframes/series will returns dataframes/series that carry the actual indices of the rows selected. – Julien Marrec Jan 14 '20 at 08:24
  • numpy is much faster and for this solution, you would need to convert it to pandas and back to numpy only for the split, if you want to stick with numpy further on. The most voted answer seems easier and better. – questionto42 Sep 15 '21 at 08:54
12

Here's the simplest solution (Jibwa made it seem complicated in another answer), without having to generate indices yourself - just using the ShuffleSplit object to generate 1 split.

import numpy as np 
from sklearn.model_selection import ShuffleSplit # or StratifiedShuffleSplit
sss = ShuffleSplit(n_splits=1, test_size=0.1)

data_size = 100
X = np.reshape(np.random.rand(data_size*2),(data_size,2))
y = np.random.randint(2, size=data_size)

sss.get_n_splits(X, y)
train_index, test_index = next(sss.split(X, y)) 

X_train, X_test = X[train_index], X[test_index] 
y_train, y_test = y[train_index], y[test_index]
Michał Gacka
  • 2,935
  • 2
  • 29
  • 45
2

The docs mention train_test_split is just a convenience function on top of shuffle split.

I just rearranged some of their code to make my own example. Note the actual solution is the middle block of code. The rest is imports, and setup for a runnable example.

from sklearn.model_selection import ShuffleSplit
from sklearn.utils import safe_indexing, indexable
from itertools import chain
import numpy as np
X = np.reshape(np.random.randn(20),(10,2)) # 10 training examples
y = np.random.randint(2, size=10) # 10 labels
seed = 1

cv = ShuffleSplit(random_state=seed, test_size=0.25)
arrays = indexable(X, y)
train, test = next(cv.split(X=X))
iterator = list(chain.from_iterable((
    safe_indexing(a, train),
    safe_indexing(a, test),
    train,
    test
    ) for a in arrays)
)
X_train, X_test, train_is, test_is, y_train, y_test, _, _  = iterator

print(X)
print(train_is)
print(X_train)

Now I have the actual indexes: train_is, test_is

Jibwa
  • 41
  • 1
2

If you are using pandas you can access the index by calling .index of whatever array you wish to mimic. The train_test_split carries over the pandas indices to the new dataframes.

In your code you simply use x1.index and the returned array is the indexes relating to the original positions in x.

ReneBt
  • 150
  • 7