3

I have some functions, part of a big analysis software, that require a boolean mask to divide array items in two groups. These functions are like this:

def process(data, a_mask):
    b_mask = -a_mask
    res_a = func_a(data[a_mask])
    res_b = func_b(data[b_mask])
    return res_a, res_b

Now, I need to use these functions (with no modification) with a big array that has items of only class "a", but I would like to save RAM and do not pass a boolean mask with all True. For example I could pass a slice like slice(None, None).

The problem is that the line b_mask = -a_mask will fail if a_mask is a slice. Ideally -a_mask should give a 0-items selection.

I was thinking of creating a "modified" slice object that implements the __neg__() method as a null slice (for example slice(0, 0)). I don't know if this is possible.

Other solutions that allow to don't modify the process() function but at the same time avoid allocating an all-True boolean array will be accepted as well.

user2304916
  • 7,882
  • 5
  • 39
  • 53

3 Answers3

2

Unfortunately we can't add a __neg__() method to slice, since it cannot be subclassed. However, tuple can be subclassed, and we can use it to hold a single slice object.

This leads me to a very, very nasty hack which should just about work for you:

class NegTuple(tuple):
    def __neg__(self):
        return slice(0)

We can create a NegTuple containing a single slice object:

nt = NegTuple((slice(None),))

This can be used as an index, and negating it will yield an empty slice resulting in a 0-length array being indexed:

a = np.arange(5)
print a[nt]
# [0 1 2 3 4]
print a[-nt]
# []

You would have to be very desperate to resort to something like this, though. Is it totally out of the question to modify process like this?

def process(data, a_mask=None):
    if a_mask is None:
        a_mask = slice(None)  # every element
        b_mask = slice(0)     # no elements
    else:
        b_mask = -a_mask
    res_a = func_a(data[a_mask])
    res_b = func_b(data[b_mask])
    return res_a, res_b

This is way more explicit, and should not have any affect on its behavior for your current use cases.

Community
  • 1
  • 1
ali_m
  • 71,714
  • 23
  • 223
  • 298
  • OP is looking for a solution that doesn't modify `process` – Peter Gibson Feb 10 '14 at 00:25
  • @PeterGibson against my better judgement I've updated my answer with such a solution – ali_m Feb 10 '14 at 01:12
  • That is truly horrendous, but well done for providing a working solution :) – Peter Gibson Feb 10 '14 at 01:17
  • I agree that is not pretty, but the `process` function is actually quite complex, and a big part of the analysis relies on its correctness. I'd rather do an hack for this specific use case than risk to mess with the other (already well tested) use cases. Thanks. – user2304916 Feb 10 '14 at 17:32
  • Suit yourself, but if I were you I would be more concerned with something potentially going wrong with the hack. It doesn't really matter how complicated the guts of your `process` function are - in the end, all the changes will amount to is a keyword argument and 5 or 6 new lines to handle the conditional. – ali_m Feb 10 '14 at 17:37
0

Your solution is very similar to a degenerate sparse boolean array, although I don't know of any implementations of the same. My knee-jerk reaction is one of dislike, but if you really can't modify process it's probably the best way.

U2EF1
  • 12,907
  • 3
  • 35
  • 37
  • What is a degenerate sparse boolean array and how it would allow to avoid creating an all-True boolean array? – user2304916 Feb 10 '14 at 00:49
  • @user2304916 A sparse matrix compresses certain values (usually 0). A boolean sparse matrix could easily be written to compress whichever value (True or False) constituted the majority. Yours is degenerate because it only works for matrices that are all zeros or all ones. – U2EF1 Feb 10 '14 at 01:38
  • Yes, a boolean sparse matrix that can compress either True or False would do the job. Would be nice to have such a beast in scipy/numpy IMHO. – user2304916 Feb 10 '14 at 17:35
0

If you are concerned about memory use, then advanced indexing may be a bad idea. From the docs

Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view).

As it stands, the process function has:

  • data of size n say
  • a_mask of size n (assuming advanced indexing)

And creates:

  • b_mask of size n
  • data[a_mask] of size m say
  • data[b_mask] of size n - m

This is effectively 4 arrays of size n.

Basic slicing seems to be your best option then, however Python doesn't seem to allow subclassing slice:

TypeError: Error when calling the metaclass bases
    type 'slice' is not an acceptable base type

See @ali_m's answer for a solution that incorporates slicing.

Alternatively, you could just bypass process and get your results as

result = func_a(data), func_b([])
Peter Gibson
  • 19,086
  • 7
  • 60
  • 64
  • True, but sometimes fancy indexing is the only way to go (if your target elements are irregularly spaced, etc.) – ali_m Feb 10 '14 at 01:18
  • Unfortunately I can't use basic slicing here since the elements of the two groups are disordered. Thanks for the insight on the memory usage of using a boolean mask though. – user2304916 Feb 10 '14 at 17:29
  • @user2304916 yes, for general use I can see the boolean mask is necessary. I was referring specifically to the case in your question where slicing is appropriate. – Peter Gibson Feb 11 '14 at 00:07