1

I have a dataset X such that X.shape yields (10000, 9). I want to choose a subset of X with the following code:

X = np.asarray(np.random.normal(size = (10000,9)))
train_fraction = 0.7 # fraction of X that will be marked as train data
train_size = int(X.shape[0]*train_fraction) # fraction converted to number
test_size = X.shape[0] - train_size # remaining rows will be marked as test data
train_ind = np.asarray([False]*X.shape[0])     
train_ind[np.random.randint(low = X.shape[0], size = (train_size,))] = True # mark True at 70% of the places

The problem is that np.sum(train_ind) is not the expected value of 7000. Instead it gives random values like 5033, etc.

I initially thought that np.random.randint(low = X.shape[0], size = (train_size,)) might be the culprit. But when I do np.random.randint(low = X.shape[0], size = (train_size,)).shape I get (7000,).

Where am I going wrong?

Jürg W. Spaak
  • 2,057
  • 1
  • 15
  • 34
Clock Slave
  • 7,627
  • 15
  • 68
  • 109
  • 1
    There are better ways to initialize a boolean numpy array, have a look [here](https://stackoverflow.com/questions/21174961/how-to-create-a-numpy-array-of-all-true-or-all-false), I suggest the second best answer, not the accepted one. – Jürg W. Spaak Aug 10 '17 at 10:56
  • @JürgMerlinSpaak Thanks. This was helpful. – Clock Slave Aug 10 '17 at 10:57

1 Answers1

2

Take np.random.choice(np.arange(0,X.shape[0]), size = train_size, replace = False)

The problem is, that np.random.randint will not be injectiv, basically the number 1 might apear twice. This means that index 1 will be set to True twice, while another one will not be set to True.

The np.random.choice function ensures, that every number will occur at most once (if you set replace = False

Jürg W. Spaak
  • 2,057
  • 1
  • 15
  • 34
  • This works. Thanks. You mention: 'basically the number 1 might apear twice. This means that index 1 will be set to True twice'. Yes, agreed. I realised that might be the problem. But when I run the code I pasted and take the sum, I get values above 7000 as well. Maybe I should state that more explicitly in the question. Your post answers the question why the sum could be less than 7000 but in the cases where sum is above 7000, what is going on is my main concern – Clock Slave Aug 10 '17 at 10:51
  • Yes, I was wondering that as well, it's kind of weird... I claim that you run the line `train_ind[np.random.randint(low = X.shape[0], size = (train_size,))] = True` several times, without resetting `train_ind`, I don't see how else this could happen. Tell me if that's not the case – Jürg W. Spaak Aug 10 '17 at 10:53
  • 1
    I ran the following code to see if I could reproduce the case where I got the sum to be in excess of 7000. `num_cases = 0 for i in range(10000): train_ind = np.asarray([False]*X.shape[0]) train_ind[np.random.randint(low = X.shape[0], size = (train_size,))] = True if sum(train_ind) > 7000: print(sum(train_ind)) num_cases+=1 print(num_cases) ` At the end of this loop I got `num_cases` to be zero. I guess I might have run the assignment line twice before initializing the `train_ind` array. Edited the question. Thanks – Clock Slave Aug 10 '17 at 11:04