3

Background

I have one 1D NumPy array initialized with zeroes.

import numpy as np
section = np.zeros(1000)

Then I have a Pandas DataFrame where I have indices in two columns:

d= {'start': {0: 7200, 1: 7500, 2: 7560, 3: 8100, 4: 11400},
    'end': {0: 10800, 1: 8100, 2: 8100, 3: 8150, 4: 12000}}

df = pd.DataFrame(data=d, columns=['start', 'end'])

For each pair of indices, I want to set the value of the corresponding indices in the numpy array to True.

My current solution

I can do this by applying a function to the DataFrame:

def fill_array(row):
    section[row.start:row.end] = True

df.apply(fill_array, axis=1)

I want to vectorize this operation

This works as I expect, but for the fun of it, I would like to vectorize the operation. I'm not very proficient with this, and my searching online has not put me on the right track.

I would really appreciate any suggestions on how to make this into a vector operation, if at all possible.

2 Answers2

5

The trick for the implementation to follow is that we would put 1s at every start points and -1s at every end points on a zeros initialized int array. The actual trick comes next, as we would cumulatively sum it, giving us non-zero numbers for the positions covered by the bin (start-stop pair) boundaries. So, the final step is to look for non-zeros for a final output as a boolean array. Thus, we would have two vectorized solutions, with their implementations shown below -

def filled_array(start, end, length):
    out = np.zeros((length), dtype=int)
    np.add.at(out,start,1)
    np.add.at(out,end,-1)
    return out.cumsum()>0

def filled_array_v2(start, end, length): #Using @Daniel's suggestion
    out =np.bincount(start, minlength=length) - np.bincount(end, minlength=length)
    return out.cumsum().astype(bool)

Sample run -

In [2]: start
Out[2]: array([ 4,  7,  5, 15])

In [3]: end
Out[3]: array([12, 12,  7, 17])

In [4]: out = filled_array(start, end, length=20)

In [7]: pd.DataFrame(out) # print as dataframe for easy verification
Out[7]: 
        0
0   False
1   False
2   False
3   False
4    True
5    True
6    True
7    True
8    True
9    True
10   True
11   True
12  False
13  False
14  False
15   True
16   True
17  False
18  False
19  False
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • minor quibble: `maxlen` should probably be `minlen`, since if your `zeros` array is too short `add.at` will fail. Also for performance you could do `out = np.bincount(start, minlength=minlen) - np.bincount(end, minlength=minlen)` and `return out.cumsum().astype(bool)` since integers !=0 will resolve to `TRUE` without the comparison step – Daniel F Jul 12 '17 at 13:01
  • @DanielF Good points, thanks a lot! Edited. That name `maxlen` wasn't a correct one. Would leave that to OP to make sure the length to be mentioned as the input arg covers all indices. – Divakar Jul 12 '17 at 13:13
  • Thanks @Divakar! This is a great answer. I like the algorithm you suggested. Clever! I measured it at 245 µs, and with the suggestions from @DanielF it came down to only 150 µs per loop. – Knut Flage Henriksen Jul 12 '17 at 22:11
1

Vectorization

You have already done the most important vectorization by using slice assignment, but you cannot fully vectorize this using slices since python does not support "multiple slices".

If you really badly want to use vectorization you can create an array with the "True" indices, like this

indices = np.r_[tuple(slice(row.start, row.end) for row in df.itertuples())]
section[indices] = True

But this will most likely be slower, since it creates a new temporary array with indices.

Removing duplicate work

With that said you could gain some speed-ups by reducing duplicate work. Specifically, you can take the union of the ranges, giving you a set of disjoint sets.

In your case, the first interval overlaps all except the last one, so your dataframe is equivalent to

d= {'start': {0: 7200, 1: 11400},
    'end': {0: 10800, 1: 12000}}

This reduces the amount of work by up to 60%! But first we need to find these intervals. Following the answer quoted above, we can do this by:

slices = [(row.start, row.end) for row in df.itertuples()]
slices_union = []
for start, end in sorted(slices):
    if slices_union and slices_union[-1][1] >= start - 1:
        slices_union[-1][1] = max(slices_union[-1][1], end)
    else:
        slices_union.append([start, end])

Then you can use these (hopefully much smaller slices) like this

for start, end in slices_union:
    section[start:end] = True
Jonas Adler
  • 10,365
  • 5
  • 46
  • 73
  • As it turns out, we can vectorize, just need to change the way we do it :) – Divakar Jul 12 '17 at 12:44
  • Well he can, but i doubt it's worth it. I gave a solution in the end using `np.r_`, which I guess is the easiest solution. – Jonas Adler Jul 12 '17 at 12:47
  • Apologies for trying to be correct about terminologies, but a list/dataframe comprehension isn't a vectorized solution, specially when put in big headers :) – Divakar Jul 12 '17 at 12:57
  • 1
    Thank you for your suggestion, @JonasAdler I had to make a small change to your code to get it to work. The original threw a `AttributeError: 'str' object has no attribute 'start'` error. It resolved easily by doing `df.itertuples()` instead of `df`. As you suspected, the vectorized version was a little bit slower than the iterative one, with 747 µs versus 649 µs with iteration. For comparison, my original function clocks in at 413 µs. – Knut Flage Henriksen Jul 12 '17 at 22:17
  • Updated according to your comment. Did you try the other trick? – Jonas Adler Jul 13 '17 at 13:07