2

Given an two arrays: an input array and a repeat array, I would like to receive an array which is repeated along a new dimension a specified amount of times for each row and padded until the ending.

to_repeat = np.array([1, 2, 3, 4, 5, 6])
repeats = np.array([1, 2, 2, 3, 3, 1])
# I want final array to look like the following:
#[[1, 0, 0],
# [2, 2, 0],
# [3, 3, 0],
# [4, 4, 4],
# [5, 5, 5],
# [6, 0, 0]]

The issue is that I'm operating with large datasets (10M or so) so a list comprehension is too slow - what is a fast way to achieve this?

Raven
  • 648
  • 1
  • 7
  • 18

1 Answers1

4

Here's one with masking based on this idea -

m = repeats[:,None] > np.arange(repeats.max())
out = np.zeros(m.shape,dtype=to_repeat.dtype)
out[m] = np.repeat(to_repeat,repeats)

Sample output -

In [44]: out
Out[44]: 
array([[1, 0, 0],
       [2, 2, 0],
       [3, 3, 0],
       [4, 4, 4],
       [5, 5, 5],
       [6, 0, 0]])

Or with broadcasted-multiplication -

In [67]: m*to_repeat[:,None]
Out[67]: 
array([[1, 0, 0],
       [2, 2, 0],
       [3, 3, 0],
       [4, 4, 4],
       [5, 5, 5],
       [6, 0, 0]])

For large datasets/sizes, we can leverage multi-cores and be more efficient on memory with numexpr module on that broadcasting -

In [64]: import numexpr as ne

# Re-using mask `m` from previous method
In [65]: ne.evaluate('m*R',{'m':m,'R':to_repeat[:,None]})
Out[65]: 
array([[1, 0, 0],
       [2, 2, 0],
       [3, 3, 0],
       [4, 4, 4],
       [5, 5, 5],
       [6, 0, 0]])
Divakar
  • 218,885
  • 19
  • 262
  • 358