0

l want to split data into train and test and also a vector that contains names (it serves me as an index and reference).

name_images has a shape of (2440,)

My data are :

data has a shape of (2440, 3072) 
labels has a shape of (2440,)

from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test= train_test_split(data, labels, test_size=0.3)

but l want also to split my name_images into name_images_train and name_images_test with respect to the split of data and labels

l tried

  x_train, x_test, y_train, y_test,name_images_train,name_images_test= train_test_split(data, labels,name_images, test_size=0.3)

it doesn't preserve the order Any suggestions thank you

EDIT1:

x_train, x_test, y_train, y_test= train_test_split(data, labels,test_size=0.3, random_state=42)

name_images_train, name_images_test=train_test_split(name_images, 
                                                         test_size=0.3, 
                                                         random_state=42)

EDIT1 don't preserve the order

vincent
  • 1,558
  • 4
  • 21
  • 34
  • I am not understanding. You want to preserve order each time you call this `train_test_split`, or do you want to preserve the order of splitting of `data`, `labels` and `name_images` during the same call to `train_test_split`? – Vivek Kumar Apr 06 '17 at 11:35
  • l want to preserve the order of splitting of data, labels and name_images during the same call – vincent Apr 06 '17 at 12:22
  • 1
    That is what my answer does. That means that if train_X gets index [1,5,7..] then train_y and name_images_train will also get the same indices. If that still dont fit your need, can you give an example of what output you want – Vivek Kumar Apr 06 '17 at 12:56

2 Answers2

0

There are multiple ways to accomplish this.

The most straight forward is to use random_state parameter of train_test_split. As the documentation states:

random_state : int or RandomState :-
Pseudo-random number generator state used for random sampling.

When you fix the random_state, the indices which are generated for splitting the arrays into train and test are exact same each time.

So change your code to:

x_train, x_test, 
y_train, y_test, 
name_images_train, name_images_test=train_test_split(data, labels, name_images, 
                                                     test_size=0.3, 
                                                     random_state=42)

For more understanding on random_state, see my answer here:

Community
  • 1
  • 1
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • l tried also the following but it doesn't work : x_train, x_test, y_train, y_test, name_images_train, name_images_test=train_test_split(data_pixels, classes, images_names, test_size=0.3, random_state=42) – vincent Apr 06 '17 at 12:33
0

In my case, I realize that my input arrays were not in proper order in the first place. So for future Googlers--you may want to double-check if (data, labels) are in the same order or not.

tash
  • 711
  • 5
  • 13