Numpy: Get random set of rows from 2D array

Question

I have a very large 2D array which looks something like this:

a=
[[a1, b1, c1],
 [a2, b2, c2],
 ...,
 [an, bn, cn]]

Using numpy, is there an easy way to get a new 2D array with, e.g., 2 random rows from the initial array a (without replacement)?

e.g.

b=
[[a4,  b4,  c4],
 [a99, b99, c99]]

its silly to have a question one for replacement and one without, you should just allow both answers and in fact encourage both answers. — Charlie Parker, Jun 19 '16 at 21:54

Daniel · Accepted Answer · 2016-10-31T02:19:38.543

287

>>> A = np.random.randint(5, size=(10,3))
>>> A
array([[1, 3, 0],
       [3, 2, 0],
       [0, 2, 1],
       [1, 1, 4],
       [3, 2, 2],
       [0, 1, 0],
       [1, 3, 1],
       [0, 4, 1],
       [2, 4, 2],
       [3, 3, 1]])
>>> idx = np.random.randint(10, size=2)
>>> idx
array([7, 6])
>>> A[idx,:]
array([[0, 4, 1],
       [1, 3, 1]])

Putting it together for a general case:

A[np.random.randint(A.shape[0], size=2), :]

For non replacement (numpy 1.7.0+):

A[np.random.choice(A.shape[0], 2, replace=False), :]

I do not believe there is a good way to generate random list without replacement before 1.7. Perhaps you can setup a small definition that ensures the two values are not the same.

edited Oct 31 '16 at 02:19

answered Jan 10 '13 at 16:35

Daniel

19,179
7
60
74

6

There is maybe not a good way, but a way that is just as good as `np.random.choice`, and that is `np.random.permutation(A.shape[0])[:2]`, actually its not great, but that is what `np.random.choice` at this time... or if you don't care to change your array in-place, `np.random.shuffle` – seberg Jan 10 '13 at 17:02
1

Before numpy 1.7, use [random](http://docs.python.org/2.7/library/random.html).sample( xrange(10), 2 ) – denis Jan 15 '13 at 12:19
5

why are you naming your variables A and B and stuff? it makes it harder to read. – Charlie Parker Jun 19 '16 at 21:53
2

@CharlieParker Does it? Matrices are often denoted by single capital letters. – jtlz2 Jul 08 '21 at 08:45
Colon-slicing along the second axis is not necessary (`[..., ]` is not necessary). – zr0gravity7 Sep 30 '21 at 00:48

Hezi Resheff · Answer 2 · 2017-04-02T10:12:13.180

77

This is an old post, but this is what works best for me:

A[np.random.choice(A.shape[0], num_rows_2_sample, replace=False)]

change the replace=False to True to get the same thing, but with replacement.

edited Apr 02 '17 at 10:12

answered Jan 07 '15 at 08:37

Hezi Resheff

947
7
7

4

@SalvadorDali I've edited Hezi's post to not choose with replacement. Once the edit is peer-reviewed, you'll see the added `replace=False` param to `choice`. – 0x24a537r9 Oct 17 '16 at 07:06
2

@SalvadorDali why not? – Him Apr 06 '20 at 18:29

score 33 · Answer 3 · answered Aug 03 '15 at 18:58

33

Another option is to create a random mask if you just want to down-sample your data by a certain factor. Say I want to down-sample to 25% of my original data set, which is currently held in the array data_arr:

# generate random boolean mask the length of data
# use p 0.75 for False and 0.25 for True
mask = numpy.random.choice([False, True], len(data_arr), p=[0.75, 0.25])

Now you can call data_arr[mask] and return ~25% of the rows, randomly sampled.

answered Aug 03 '15 at 18:58

isosceleswheel

1,516
12
20

You may want to add ```replace = False``` if you don't want sampling with replacement. – Sarah Jul 22 '20 at 21:32
@Sarah Replacement is not an issue with this sampling method because a True/False value is returned for every position in `data_arr`. In my example, a random ~25% of the positions will be `True` and those positions are sampled from `data_arr`. – isosceleswheel Jul 24 '20 at 03:05
You are right. We don't need the ```replace=False```. And as you pointed out, the number of records sampled is only approximated and not exact. – Sarah Jul 31 '20 at 15:00
It's an interesting method. However, the number of sampled rows is an approximate to the desired (as stated in the answer). It may not work if you need exactly k rows sampled. – Eb Abadi Feb 23 '22 at 22:36

score 32 · Answer 4 · edited Jul 08 '21 at 08:47

32

This is a similar answer to the one Hezi Rasheff provided, but simplified so newer python users understand what's going on (I noticed many new datascience students fetch random samples in the weirdest ways because they don't know what they are doing in python).

You can get a number of random indices from your array by using:

indices = np.random.choice(A.shape[0], number_of_samples, replace=False)

You can then use fancy indexing with your numpy array to get the samples at those indices:

A[indices]

This will get you the specified number of random samples from your data.

edited Jul 08 '21 at 08:47

jtlz2

7,700
9
64
114

answered Dec 20 '18 at 10:35

CB Madsen

582
7
8

3

Seems to be the best solution, and should be the selected answer. "*You can then use slicing*", typo: [fancy indexing](https://scipy-lectures.org/intro/numpy/array_object.html#indexing-with-an-array-of-integers). – mins Jan 16 '21 at 11:19
@mins "Fancy indexing" is indeed the correct terminology rather than "Slicing". I fixed this. Thank you. – CB Madsen Jan 19 '21 at 01:34

score 5 · Answer 5 · answered Oct 19 '18 at 21:35

5

I see permutation has been suggested. In fact it can be made into one line:

>>> A = np.random.randint(5, size=(10,3))
>>> np.random.permutation(A)[:2]

array([[0, 3, 0],
       [3, 1, 2]])

answered Oct 19 '18 at 21:35

orli

181
1
5

score 2 · Answer 6 · answered Oct 23 '18 at 11:24

If you want to generate multiple random subsets of rows, for example if your doing RANSAC.

num_pop = 10
num_samples = 2
pop_in_sample = 3
rows_to_sample = np.random.random([num_pop, 5])
random_numbers = np.random.random([num_samples, num_pop])
samples = np.argsort(random_numbers, axis=1)[:, :pop_in_sample]
# will be shape [num_samples, pop_in_sample, 5]
row_subsets = rows_to_sample[samples, :]

Snoopy · Answer 7 · 2020-10-21T20:59:47.980

An alternative way of doing it is by using the choice method of the Generator class, https://github.com/numpy/numpy/issues/10835

import numpy as np

# generate the random array
A = np.random.randint(5, size=(10,3))

# use the choice method of the Generator class
rng = np.random.default_rng()
A_sampled = rng.choice(A, 2)

leading to a sampled data,

array([[1, 3, 2],
       [1, 2, 1]])

The running time is also profiled compared as follows,

%timeit rng.choice(A, 2)
15.1 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit np.random.permutation(A)[:2]
4.22 µs ± 83.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit A[np.random.randint(A.shape[0], size=2), :]
10.6 µs ± 418 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

But when the array goes big, A = np.random.randint(10, size=(1000,300)). working on the index is the best way.

%timeit A[np.random.randint(A.shape[0], size=50), :]
17.6 µs ± 657 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit rng.choice(A, 50)
22.3 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.random.permutation(A)[:50]
143 µs ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

So the permutation method seems to be the most efficient one when your array is small while working on the index is the optimal solution when your array goes big.

score 2 · Answer 8 · answered Aug 17 '21 at 08:35

2

One can generates a random sample from a given array with a random number generator:

rng = np.random.default_rng()
b = rng.choice(a, 2, replace=False)
b
>>> [[a4,  b4,  c4],
    [a99, b99, c99]]

answered Aug 17 '21 at 08:35

Antiez

679
7
11

score 1 · Answer 9 · answered May 16 '17 at 22:55

1

If you need the same rows but just a random sample then,

import random
new_array = random.sample(old_array,x)

Here x, has to be an 'int' defining the number of rows you want to randomly pick.

answered May 16 '17 at 22:55

Ankit Agrawal

811
1
7
3

7

This only works if `old_array` is a sequence or a set, not a numpy array [link] (https://docs.python.org/3/library/random.html#functions-for-sequences) – leermeester Apr 11 '18 at 07:44

Skippy le Grand Gourou · Answer 10 · 2023-01-17T11:11:35.093

1

I am quite surprised that this much easier to read solution has not been proposed after more than 10 years :

import random

b = np.array(
    random.choices(a, k=2)
)

Edit : Ah, maybe because it was only introduced in Python 3.6, but still…

edited Jan 17 '23 at 11:11

answered Jan 17 '23 at 11:05

Skippy le Grand Gourou

6,976
4
60
76

How does this compare in speed to the numpy random generator? – dawid Aug 06 '23 at 04:19

Numpy: Get random set of rows from 2D array

10 Answers10

Linked

Related