4

So I have 2d numpay array arr. It's a relatively big one: arr.shape = (2400, 60000)

What I'm currently doing is the following:

  • randomly (with replacement) select arr.shape[0] indices
  • access (row-wise) chosen indices of arr
  • calculating column-wise averages and selecting max value
  • I'm repeating it for k times

It looks sth like:

no_rows = arr.shape[0]
indicies = np.array(range(no_rows))
my_vals = []
for k in range(no_samples):
    random_idxs = np.random.choice(indicies, size=no_rows, replace=True)
    my_vals.append(
        arr[random_idxs].mean(axis=0).max()
    )

My problem is that is very slow. With my arr size, it takes ~3s for 1 loop. As I want a sample that is bigger than 1k - my current solution solution pretty bad (1k*~3s -> ~1h). I've profiled it and the bottleneck is accessing row based on indices. "mean" and "max" work fast. np.random.choice is also ok.

Do you see any area for improvement? A more efficient way of accessing indices or maybe better a faster approach that solves the problem without this?

What I tried so far:

  • numpy.take (slower)
  • numpy.ravel :

sth similar to:

random_idxs = np.random.choice(sample_idxs, size=sample_size, replace=True) 
test = random_idxs.ravel()[arr.ravel()].reshape(arr.shape)
  • similar approach to current one but without loop. I created 3d arr and accessed rows across additional dimension in one go
Dilip
  • 2,622
  • 1
  • 20
  • 27
Slaw
  • 43
  • 4

2 Answers2

2

Since advanced indexing will generate a copy, the program will allocate huge memory in arr[random_idxs].

So one of the most simple way to improve efficiency is that do things batch wise.

BATCH = 512
max(arr[random_idxs,i:i+BATCH].mean(axis=0).max() for i in range(0,arr.shape[1],BATCH))
yao99
  • 870
  • 5
  • 12
  • Thank you - that indeed helps! I've managed to speed it up ~3.5x times with a batch approach. Single iteration got reduced from ~3s to ~0.79. I'd add only that it's good to play with appropriate batch size. I've run several tests. Here are results (BATCH, avg_time_in_100_iterations): (16, 1.13), (32, 0.87), (64, 0.79), (128, 1.2), (256, 1.33), (512, 1.8) and (1024, 2.47) – Slaw Aug 02 '20 at 09:43
0

This is not a general solution to the problem, but should make your specific problem much faster. Basically, arr.mean(axis=0).max() won't change, so why not take random samples from that array?

Something like:

mean_max = arr.mean(axis=0).max()
my_vals = np.array([np.random.choice(mean_max, size=len(mean_max), replace=True) for i in range(no_samples)])

You may even be able to do: my_vals = np.random.choice(mean_max, size=(no_samples, len(mean_max)), replace=True), but I'm not sure how, if at all, that would change your statistics.

David Hoffman
  • 2,205
  • 1
  • 16
  • 30
  • Not sure if I follow. `arr.mean(axis=0).max()` WILL change - that's the whole point. I'm choosing random indices with replacement. So every iteration, means for each column should be different. Then I'm choosing the highest value from those columns means and appending it as a data point for my final list. As for your solution - it won't run. `mean_max` you proposed is a single float number. That's just the highest mean value from arr (where means are calculated column-wise). So you try to randomly choose a single value from a single value available. – Slaw Jul 31 '20 at 07:17