More efficient way to access rows based on a list of indices in 2d numpy array?

Question

So I have 2d numpay array arr. It's a relatively big one: arr.shape = (2400, 60000)

What I'm currently doing is the following:

randomly (with replacement) select arr.shape[0] indices
access (row-wise) chosen indices of arr
calculating column-wise averages and selecting max value
I'm repeating it for k times

It looks sth like:

no_rows = arr.shape[0]
indicies = np.array(range(no_rows))
my_vals = []
for k in range(no_samples):
    random_idxs = np.random.choice(indicies, size=no_rows, replace=True)
    my_vals.append(
        arr[random_idxs].mean(axis=0).max()
    )

My problem is that is very slow. With my arr size, it takes ~3s for 1 loop. As I want a sample that is bigger than 1k - my current solution solution pretty bad (1k*~3s -> ~1h). I've profiled it and the bottleneck is accessing row based on indices. "mean" and "max" work fast. np.random.choice is also ok.

Do you see any area for improvement? A more efficient way of accessing indices or maybe better a faster approach that solves the problem without this?

What I tried so far:

numpy.take (slower)
numpy.ravel :

sth similar to:

random_idxs = np.random.choice(sample_idxs, size=sample_size, replace=True) 
test = random_idxs.ravel()[arr.ravel()].reshape(arr.shape)

similar approach to current one but without loop. I created 3d arr and accessed rows across additional dimension in one go

no_samples is just integer defining sample size. I want it to be >1k — Slaw, Jul 30 '20 at 09:05

score 2 · Accepted Answer · answered Jul 31 '20 at 14:27

2

Since advanced indexing will generate a copy, the program will allocate huge memory in arr[random_idxs].

So one of the most simple way to improve efficiency is that do things batch wise.

BATCH = 512
max(arr[random_idxs,i:i+BATCH].mean(axis=0).max() for i in range(0,arr.shape[1],BATCH))

answered Jul 31 '20 at 14:27

yao99

870
5
12

Thank you - that indeed helps! I've managed to speed it up ~3.5x times with a batch approach. Single iteration got reduced from ~3s to ~0.79. I'd add only that it's good to play with appropriate batch size. I've run several tests. Here are results (BATCH, avg_time_in_100_iterations): (16, 1.13), (32, 0.87), (64, 0.79), (128, 1.2), (256, 1.33), (512, 1.8) and (1024, 2.47) – Slaw Aug 02 '20 at 09:43

score 0 · Answer 2 · answered Jul 30 '20 at 17:50

0

This is not a general solution to the problem, but should make your specific problem much faster. Basically, arr.mean(axis=0).max() won't change, so why not take random samples from that array?

Something like:

mean_max = arr.mean(axis=0).max()
my_vals = np.array([np.random.choice(mean_max, size=len(mean_max), replace=True) for i in range(no_samples)])

You may even be able to do: my_vals = np.random.choice(mean_max, size=(no_samples, len(mean_max)), replace=True), but I'm not sure how, if at all, that would change your statistics.

answered Jul 30 '20 at 17:50

David Hoffman

2,205
1
16
30

Not sure if I follow. `arr.mean(axis=0).max()` WILL change - that's the whole point. I'm choosing random indices with replacement. So every iteration, means for each column should be different. Then I'm choosing the highest value from those columns means and appending it as a data point for my final list. As for your solution - it won't run. `mean_max` you proposed is a single float number. That's just the highest mean value from arr (where means are calculated column-wise). So you try to randomly choose a single value from a single value available. – Slaw Jul 31 '20 at 07:17

More efficient way to access rows based on a list of indices in 2d numpy array?

It looks sth like:

sth similar to:

2 Answers2