1

I have a numpy 2-D array X with shape (n_samples, n_features). I want to apply masking noise to each sample i.e. each row. Basically, for each row entry, I want to randomly select a fraction frac of the total n_features elements and set them to 0.

I have vectorized the inner part of the loop till now, but cannot get rid of the outer i loop.

My current code is given below.

def add_noise(X, frac):
    X_noise = X.copy()

    n_samples = X.shape[0]
    n_features = X.shape[1]

    for i in range(n_samples):
        mask = np.random.randint(0, n_features, int(frac * n_features))
        X_noise[i][mask] = 0

    return X_noise

An example is shown below.

test_arr = np.arange(1, 11)
test_arr = np.array([test_arr, test_arr])
print(test_arr)
print(add_noise(test_arr, 0.3))

[[ 1  2  3  4  5  6  7  8  9 10]
 [ 1  2  3  4  5  6  7  8  9 10]]
[[ 1  0  3  4  5  6  0  8  9  0]   # 0.3 * num_features = 3 random elements
 [ 0  2  3  4  5  6  7  0  0 10]]  # for each row set to 0

How do I get rid of the outer loop?

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
pmcarpan
  • 660
  • 6
  • 13

3 Answers3

2

There is nothing stopping you from using np.random.randint to generate the full matrix of indices, one element per row:

k = int(frac * n_features)
indices = np.random.randint(0, n_features, size=(n_samples, k))
X_noise[np.arange(n_samples)[:, None], indices] = 0

The index np.arange(n_samples)[:, None] makes the range broadcast to shape n_samples, k. This approach has the advantage of not requiring an intermediate step with a mask.

There are a couple of potential problems with this approach:

  1. k = int(frac * n_features) is not necessarily the closest integer to the actual fraction you are looking for. Something more like k = math.round(frac * n_features).
  2. np.random.randint samples with replacement. That means that you will get collisions on the same row in the index occasionally. If you are OK with that, that's fine. If not, you can sample without replacement using np.random.choice(n_features, replace=False). The problem is that you would then have to loop over each row individually.

A more "honest" approach, in my opinion, would be to generate a sequence of random numbers, and simply threshold them at frac, so that your overall noise approached frac, but the noise in each row would be random. The numbers could be generated with something like np.random.sample:

X_noise[np.random.sample(size=X_noise.shape) < frac] = 0
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • Is it possible to somehow use `np.random.choice(...)` to generate a `num_samples x k` array sampling from `[0, num_features)`? Then the approach would remain almost the same. – pmcarpan Feb 11 '19 at 15:37
  • @pmcarpan. At that point, you yould have to do something like https://stackoverflow.com/q/47722005/2988730 – Mad Physicist Feb 11 '19 at 15:46
1

try creating a map of zeroes and ones, and multiply the test array with the map:

zero_map = np.round(np.random.rand(*test_arr.shape) * (1-frac))
test_arr = test_arr * zero_map
Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
Zulfiqaar
  • 603
  • 1
  • 6
  • 12
0

you can use the numpy function apply_along_axis.

def add_noise(X, frac):
    X_noise = X.copy()

    n_samples = X.shape[0]
    n_features = X.shape[1]

    mask = np.concatenate((np.ones((n_samples,int(frac * n_features)), dtype=np.bool),
                           np.zeros((n_samples, n_features - int(frac * n_features)), dtype=np.bool)),
                           axis=1)
    np.apply_along_axis(np.random.shuffle,1,mask)
    X_noise[mask] = 0
    return X_noise
Yohai Magan
  • 279
  • 1
  • 12