Saving order of splitting with a vector of index

Question

l want to split data into train and test and also a vector that contains names (it serves me as an index and reference).

name_images has a shape of (2440,)

My data are :

data has a shape of (2440, 3072) 
labels has a shape of (2440,)

from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test= train_test_split(data, labels, test_size=0.3)

but l want also to split my name_images into name_images_train and name_images_test with respect to the split of data and labels

l tried

  x_train, x_test, y_train, y_test,name_images_train,name_images_test= train_test_split(data, labels,name_images, test_size=0.3)

it doesn't preserve the order Any suggestions thank you

EDIT1:

x_train, x_test, y_train, y_test= train_test_split(data, labels,test_size=0.3, random_state=42)

name_images_train, name_images_test=train_test_split(name_images, 
                                                         test_size=0.3, 
                                                         random_state=42)

EDIT1 don't preserve the order

I am not understanding. You want to preserve order each time you call this `train_test_split`, or do you want to preserve the order of splitting of `data`, `labels` and `name_images` during the same call to `train_test_split`? — Vivek Kumar, Apr 06 '17 at 11:35
l want to preserve the order of splitting of data, labels and name_images during the same call — vincent, Apr 06 '17 at 12:22
That is what my answer does. That means that if train_X gets index [1,5,7..] then train_y and name_images_train will also get the same indices. If that still dont fit your need, can you give an example of what output you want — Vivek Kumar, Apr 06 '17 at 12:56

score 0 · Accepted Answer · edited May 23 '17 at 12:02

There are multiple ways to accomplish this.

The most straight forward is to use random_state parameter of train_test_split. As the documentation states:

random_state : int or RandomState :-
Pseudo-random number generator state used for random sampling.

When you fix the random_state, the indices which are generated for splitting the arrays into train and test are exact same each time.

So change your code to:

x_train, x_test, 
y_train, y_test, 
name_images_train, name_images_test=train_test_split(data, labels, name_images, 
                                                     test_size=0.3, 
                                                     random_state=42)

For more understanding on random_state, see my answer here:

https://stackoverflow.com/a/42197534/3374996

l tried also the following but it doesn't work : x_train, x_test, y_train, y_test, name_images_train, name_images_test=train_test_split(data_pixels, classes, images_names, test_size=0.3, random_state=42) — vincent, Apr 06 '17 at 12:33

score 0 · Answer 2 · answered Dec 15 '21 at 16:05

0

In my case, I realize that my input arrays were not in proper order in the first place. So for future Googlers--you may want to double-check if (data, labels) are in the same order or not.

answered Dec 15 '21 at 16:05

tash

711
5
13

Saving order of splitting with a vector of index

2 Answers2