1

I have got two datasets new_train_db1 (with size 3000x200) and new_train_db2 (with size 3000x200) and the correspondant labels train_labels (3000x1). I want to subsample new_train_db1, new_train_db2 and train_labels and keeping just 100 samples. I have the following code:

np.random.seed(0)
reduced_train_db1 = new_train_db1[np.random.randint(new_train_db1.shape[0], size=100), :]
np.random.seed(0)
reduced_train_db2 = new_train_db2[np.random.randint(new_train_db2.shape[0], size=100), :]
np.random.seed(0)
reduced_labels = train_labels[np.random.randint(train_labels.shape[0], size=100)]

Actually, what i want is to keep the same samples every time that I run the code. How can I do so?

Clock Slave
  • 7,627
  • 15
  • 68
  • 109
Jose Ramon
  • 5,572
  • 25
  • 76
  • 152

1 Answers1

2

The problem is that you are using np.randon.randint three times so obviously the result is going to be different everytime. Why not just run it once and reuse the same indices everywhere.

You can define a do a one time operation in which you generate random indices using the numpy.random.randint function and store it in a file and write a few additional lines wherein you read this file and use it for selecting the same rows on every run. You can get the indices using -

ind = np.random.randint(3000,size = (100,))

You can save and load the array using numpy.save and numpy.load.

If you intend on using same rows only during one session and don't necessarily want them to be same across all your sessions, you don't have to save and load as well.

Clock Slave
  • 7,627
  • 15
  • 68
  • 109