How to add masking noise to numpy 2-D matrix in a vectorized manner?

Question

I have a numpy 2-D array X with shape (n_samples, n_features). I want to apply masking noise to each sample i.e. each row. Basically, for each row entry, I want to randomly select a fraction frac of the total n_features elements and set them to 0.

I have vectorized the inner part of the loop till now, but cannot get rid of the outer i loop.

My current code is given below.

def add_noise(X, frac):
    X_noise = X.copy()

    n_samples = X.shape[0]
    n_features = X.shape[1]

    for i in range(n_samples):
        mask = np.random.randint(0, n_features, int(frac * n_features))
        X_noise[i][mask] = 0

    return X_noise

An example is shown below.

test_arr = np.arange(1, 11)
test_arr = np.array([test_arr, test_arr])
print(test_arr)
print(add_noise(test_arr, 0.3))

[[ 1  2  3  4  5  6  7  8  9 10]
 [ 1  2  3  4  5  6  7  8  9 10]]
[[ 1  0  3  4  5  6  0  8  9  0]   # 0.3 * num_features = 3 random elements
 [ 0  2  3  4  5  6  7  0  0 10]]  # for each row set to 0

How do I get rid of the outer loop?

@norok2 Can you kindly elaborate? If you mean the no. of elements being set to 0, then yes. — pmcarpan, Feb 11 '19 at 14:49
No, I mean if you need different `frac` values for each `n_features` — norok2, Feb 11 '19 at 14:50
Also, do you require `frac` to be exact, or is it good if this is "on average"? Meaning, in your example, that exactly 3 items must be set to 0 (per row). — norok2, Feb 11 '19 at 15:12
@norok2 Yes. For the given example, exactly 3 items must be set to 0 per row. — pmcarpan, Feb 11 '19 at 15:14
OK. Then watch out that your code could also fail if some of the indexes get repeated in `randint()`. You probably want to use `np.random.choice()` instead. — norok2, Feb 11 '19 at 15:15

Mad Physicist · Accepted Answer · 2019-02-11T15:31:45.783

There is nothing stopping you from using np.random.randint to generate the full matrix of indices, one element per row:

k = int(frac * n_features)
indices = np.random.randint(0, n_features, size=(n_samples, k))
X_noise[np.arange(n_samples)[:, None], indices] = 0

The index np.arange(n_samples)[:, None] makes the range broadcast to shape n_samples, k. This approach has the advantage of not requiring an intermediate step with a mask.

There are a couple of potential problems with this approach:

k = int(frac * n_features) is not necessarily the closest integer to the actual fraction you are looking for. Something more like k = math.round(frac * n_features).
np.random.randint samples with replacement. That means that you will get collisions on the same row in the index occasionally. If you are OK with that, that's fine. If not, you can sample without replacement using np.random.choice(n_features, replace=False). The problem is that you would then have to loop over each row individually.

A more "honest" approach, in my opinion, would be to generate a sequence of random numbers, and simply threshold them at frac, so that your overall noise approached frac, but the noise in each row would be random. The numbers could be generated with something like np.random.sample:

X_noise[np.random.sample(size=X_noise.shape) < frac] = 0

Is it possible to somehow use `np.random.choice(...)` to generate a `num_samples x k` array sampling from `[0, num_features)`? Then the approach would remain almost the same. — pmcarpan, Feb 11 '19 at 15:37
@pmcarpan. At that point, you yould have to do something like https://stackoverflow.com/q/47722005/2988730 — Mad Physicist, Feb 11 '19 at 15:46

score 1 · Answer 2 · edited Aug 08 '21 at 20:03

1

try creating a map of zeroes and ones, and multiply the test array with the map:

zero_map = np.round(np.random.rand(*test_arr.shape) * (1-frac))
test_arr = test_arr * zero_map

edited Aug 08 '21 at 20:03

Sandipan Dey

21,482
2
51
63

answered Feb 11 '19 at 14:54

Zulfiqaar

603
1
6
12

1

Or just use the mask as-is: `test_arr[~mask] = 0` – Mad Physicist Feb 11 '19 at 14:55
something like `np.round(np.random.rand(*test_arr.shape) < (1-frac))` – Sandipan Dey Aug 08 '21 at 20:04

score 0 · Answer 3 · answered Feb 11 '19 at 15:26

you can use the numpy function apply_along_axis.

def add_noise(X, frac):
    X_noise = X.copy()

    n_samples = X.shape[0]
    n_features = X.shape[1]

    mask = np.concatenate((np.ones((n_samples,int(frac * n_features)), dtype=np.bool),
                           np.zeros((n_samples, n_features - int(frac * n_features)), dtype=np.bool)),
                           axis=1)
    np.apply_along_axis(np.random.shuffle,1,mask)
    X_noise[mask] = 0
    return X_noise

How to add masking noise to numpy 2-D matrix in a vectorized manner?

3 Answers3