0

Assume a numpy array X of shape m x n and type float64. The rows of X need to pass through an element-wise median-of-means computation. Specifically, the m row indices are partitioned into b "buckets", each containing m/b such indices. Next, within each bucket I compute the mean and across the resulting means I do a final median computation.

An example that clarifies it is

import numpy as np

m = 10
n = 10000

# A random data matrix
X = np.random.uniform(low=0.0, high=1.0, size=(m,n)).astype(np.float64)

# Number of buckets to split rows into
b = 5

# Partition the rows of X into b buckets
row_indices = np.arange(X.shape[0])
buckets = np.array(np.array_split(row_indices, b))
X_bucketed = X[buckets, :]

# Compute the mean within each bucket
bucket_means = np.mean(X_bucketed, axis=1)

# Compute the median-of-means
median = np.median(bucket_means, axis=0)

# Edit - Method 2 (based on answer)
np.random.shuffle(row_indices)
X = X[row_indices, :]
buckets2 = np.array_split(X, b, axis=0)
bucket_means2 = [np.mean(x, axis=0) for x in buckets2]
median2 = np.median(np.array(bucket_means2), axis=0)

This program works fine if b divides m since np.array_split() results in partitioning the indices in equal parts and array buckets is a 2D array.

However, it does not work if b does not divide m. In that case, np.array_split() still splits into b buckets but of unequal sizes, which is fine for my purposes. For example, if b = 3 it will split the indices {0,1,...,9} into [0 1 2 3], [4 5 6] and [7 8 9]. Those arrays cannot be stacked onto one another so the array buckets is not a 2D array and it cannot be used to index X_bucketed.

How can I make this work for unequal-sized buckets, i.e., to have the program compute the mean within each bucket (irrespective of its size) and then the median across the buckets?

I cannot fully grasp masked arrays and I am not sure if those can be used here.

mgus
  • 808
  • 4
  • 17
  • 39

1 Answers1

1

You can consider computing each buckets' mean separately, then stack and compute the median. Also you can just use array_split to X directly, no need to index it with a sliced index array (maybe this was your main question?).

m = 11
n = 10000

# A random data matrix
X = np.random.uniform(low=0.0, high=1.0, size=(m,n)).astype(np.float64)

# Number of buckets to split rows into
b = 5

# Partition the rows of X into b buckets
buckets = np.array_split(X, 2, axis = 0)

# Compute the mean within each bucket
b_means = [np.mean(x, axis=0) for x in buckets]

# Compute the median-of-means
median = np.median(np.array(b_means), axis=0)

print(median) #(10000,) shaped array
ddoGas
  • 861
  • 7
  • 17
  • The slicing was just because I am shuffling the rows (omitted for brevity but added now). I have adapted your idea to my code (please see edit) and it seems to work as intended. I am just wondering whether you can give me some insight on how the two methods compare in terms of speed? I'm worried about potential bottlenecks since this is supposed to be done on a large scale. – mgus Aug 06 '20 at 07:28
  • 1
    @mgus I quickly tested it in some different scales, and it seems like the second method is always faster, which is not what I expected. I'm sorry to say I don't have much insight about time efficiency. Maybe this post would help. [why-isnt-numpy-mean-multithreaded](https://stackoverflow.com/questions/16617973/why-isnt-numpy-mean-multithreaded) – ddoGas Aug 07 '20 at 03:26