0

I have the following bumpy array:

y =

array([[0],
       [2],
       [0],
       [1],
       [0],
       [1],
       [1],
       [1],
       [0],
       [0],
       [2],
       [2],
       [1],
       [2]])

I want to generate 3 lists of non-overlapping indices of rows of y as follows:

list_1 = 70% of rows
list_2 = 15% of rows
list_3 = 15% of rows

I know how to generate a single list, e.g. list_1:

import numpy as np

list_1 = [np.random.choice(np.where(y == i)[0], size=n_1, replace=False) for i in np.unique(y)]

where n_1 is equal to the number of rows that correspond to 70% of all rows. In the above example of y there are totally 14 rows. It means that 70% of 14 rows is equal to 9 (rounded down to 9). Therefore n_1 would be equal to 9.

However, I don't know how to generate the rest of lists (list_2 and list_3), so that they do not overlap with the row indices in list_1.

ScalaBoy
  • 3,254
  • 13
  • 46
  • 84
  • Maybe you could create three index arrays. Use set differences to form the next index arrays. – Stefan Feb 16 '19 at 17:58
  • 1
    just shuffle the entire array and slice the shuffle output. – Paritosh Singh Feb 16 '19 at 18:01
  • @ParitoshSingh: It is indeed a good idea. Could you please show how can I do it? – ScalaBoy Feb 16 '19 at 18:03
  • [`random.shuffle`](https://docs.python.org/3/library/random.html#random.shuffle) and https://docs.python.org/3/tutorial/introduction.html#lists – wwii Feb 16 '19 at 18:10
  • @ParitoshSingh: From the documentation of `shuffle`: "Note that even for small len(x), the total number of permutations of x can quickly grow larger than the period of most random number generators. This implies that most permutations of a long sequence can never be generated. For example, a sequence of length 2080 is the largest that can fit within the period of the Mersenne Twister random number generator." – ScalaBoy Feb 16 '19 at 18:51
  • You're asking how to **split a list/dataset into training/test/holdout sets**. There are [661 existing solutions](https://stackoverflow.com/search?q=%5Bpython%5D+split+training+test+is%3Aq): do you want numpy, base python, scikit-learn or pandas? – smci Feb 16 '19 at 19:24
  • @smci: Thanks. It's useful. But I need to get indices as the result. – ScalaBoy Feb 16 '19 at 19:30
  • @ScalaBoy: you can easily get lists of indices by doing the partitioning on the indices, rather than the data. [numpy does 3-way split directly](https://stackoverflow.com/questions/38250710/how-to-split-data-into-3-sets-train-validation-and-test); that's an exact duplicate. Or you can do [sklearn.model_selection.train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) (do it twice). – smci Feb 16 '19 at 19:36
  • @smci: I already found the solution that uses `np.random.permutation(y.shape[0])` as a starting point. It returns indices as I need. `sklearn.model_selection.train_test_split` is not what I need as it does not return indices. – ScalaBoy Feb 16 '19 at 19:40
  • I don't see that you need indices, can you explain your reason for not directly splitting the data? But anyway, here's a duplicate with `sklearn.model_selection.train_test_split` [Getting indices while using train test split in scikit](https://stackoverflow.com/questions/35622396/getting-indices-while-using-train-test-split-in-scikit?rq=1). That function only does 2-way splits, so you need to call it twice to get a 3-way split. – smci Feb 16 '19 at 19:43
  • @smci: The reason is that I have X and y objects. The object X has a shape (1000, 20, 120) and the object y has a shape (1000,1). Therefore I use `y` to get the indices and then I use these indices to filter both X and y in order to get X_train, X_val, X_test, y_train, y_val, y_test. – ScalaBoy Feb 16 '19 at 20:37
  • But [sklearn.model_selection.train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) will split X,y simultaneously (or indeed any arbitrary sequence of lists, numpy arrays, scipy-sparse matrices or pandas dataframes you give it). So you shouldn't need to keep the indices. But you can also get them from the function call if you really want. – smci Feb 16 '19 at 20:52

1 Answers1

-1

you have y and list1 now,

l2 = list(set(y) - set(list1))

Now from l2 you can run same code of np.random.choice and choose next 15% and save it in list2, then perform

list3 = list(set(l2) - set(list2))
Amit Gupta
  • 2,698
  • 4
  • 24
  • 37
  • How to get `l3`? Like this? `l3 = y.symmetric_difference(np.concatenate(list1,l2))` – ScalaBoy Feb 16 '19 at 18:49
  • you can take symmetric_difference of list2 from l2, and the remaining element will be list 3 – Amit Gupta Feb 16 '19 at 18:51
  • I don't understand. I do not have `list2`. I only have `list1` as a starting point. If I create `list2` in the same way as I created `list1` and then I apply `symmetric_difference`, then I will get a smaller number of rows in `list2` which will not correspond to 15%. – ScalaBoy Feb 16 '19 at 18:53
  • Sorry, in your update you use `list2`. As I said, I only have `list1` as a starting point. The reason is explained in the above comment. If I use your approach (if I understood it correctly), I will not get 75%/15%/15%. Can you please put the complete code starting from `list1` and show how it works on my data? – ScalaBoy Feb 16 '19 at 19:18