19

Q: given an array of integers like

[1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]

I need to mask elements that repeat more than N times. The goal is to retrieve the boolean mask array.

I came up with a rather complicated solution:

import numpy as np

bins = np.array([1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5])

N = 3
splits = np.split(bins, np.where(np.diff(bins) != 0)[0]+1)
mask = []
for s in splits:
    if s.shape[0] <= N:
        mask.append(np.ones(s.shape[0]).astype(np.bool_))
    else:
        mask.append(np.append(np.ones(N), np.zeros(s.shape[0]-N)).astype(np.bool_)) 

mask = np.concatenate(mask)

giving e.g.

bins[mask]
Out[90]: array([1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5])

Is there a nicer way to do this?


Wrap-up: Here's a slim version of MSeifert's benchmark plot (thanks for pointing me to simple_benchmark). Showing the four most performant options: enter image description here

The idea proposed by Florian H, modified by Paul Panzer seems to be a great way of solving this problem as it is pretty straight forward and numpy-only. If you're fine with using numba, MSeifert's solution outperforms the other.

I chose to accept MSeifert's answer as solution as it is the more general answer: It correctly handles arbitrary arrays with (non-unique) blocks of consecutive repeating elements. In case numba is a no-go, Divakar's answer is also worth a look.

FObersteiner
  • 22,500
  • 8
  • 42
  • 72

8 Answers8

8

Disclaimer: this is just a sounder implementation of @FlorianH's idea:

def f(a,N):
    mask = np.empty(a.size,bool)
    mask[:N] = True
    np.not_equal(a[N:],a[:-N],out=mask[N:])
    return mask

For larger arrays this makes a huge difference:

a = np.arange(1000).repeat(np.random.randint(0,10,1000))
N = 3

print(timeit(lambda:f(a,N),number=1000)*1000,"us")
# 5.443050000394578 us

# compare to
print(timeit(lambda:[True for _ in range(N)] + list(bins[:-N] != bins[N:]),number=1000)*1000,"us")
# 76.18969900067896 us
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
  • I don't think it works correctly for arbitrary arrays: For example with `[1,1,1,1,2,2,1,1,2,2]`. – MSeifert Oct 21 '19 at 20:58
  • @MSeifert From OP's example I assumed this kind of thing can't happen, but you are correct in that OP's actual code could handle your example. Well, only OP can tell, I suppose. – Paul Panzer Oct 21 '19 at 21:11
  • as I replied to user2357112's comment, in my specific case, the input is sorted and blocks of consecutive repeating elements are unique. However, from a more general perspective, it could be very useful if one could handle arbitrary arrays. – FObersteiner Oct 22 '19 at 10:23
4

Approach #1 : Here's a vectorized way -

from scipy.ndimage.morphology import binary_dilation

def keep_N_per_group(a, N):
    k = np.ones(N,dtype=bool)
    m = np.r_[True,a[:-1]!=a[1:]]
    return a[binary_dilation(m,k,origin=-(N//2))]

Sample run -

In [42]: a
Out[42]: array([1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5])

In [43]: keep_N_per_group(a, N=3)
Out[43]: array([1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5])

Approach #2 : A bit more compact version -

def keep_N_per_group_v2(a, N):
    k = np.ones(N,dtype=bool)
    return a[binary_dilation(np.ediff1d(a,to_begin=a[0])!=0,k,origin=-(N//2))]

Approach #3 : Using the grouped-counts and np.repeat (won't give us the mask though) -

def keep_N_per_group_v3(a, N):
    m = np.r_[True,a[:-1]!=a[1:],True]
    idx = np.flatnonzero(m)
    c = np.diff(idx)
    return np.repeat(a[idx[:-1]],np.minimum(c,N))

Approach #4 : With a view-based method -

from skimage.util import view_as_windows

def keep_N_per_group_v4(a, N):
    m = np.r_[True,a[:-1]!=a[1:]]
    w = view_as_windows(m,N)
    idx = np.flatnonzero(m)
    v = idx<len(w)
    w[idx[v]] = 1
    if v.all()==0:
        m[idx[v.argmin()]:] = 1
    return a[m]

Approach #5 : With a view-based method without indices from flatnonzero -

def keep_N_per_group_v5(a, N):
    m = np.r_[True,a[:-1]!=a[1:]]
    w = view_as_windows(m,N)
    last_idx = len(a)-m[::-1].argmax()-1
    w[m[:-N+1]] = 1
    m[last_idx:last_idx+N] = 1
    return a[m]
Divakar
  • 218,885
  • 19
  • 262
  • 358
4

I want to present a solution using numba which should be fairly easy to understand. I assume that you want to "mask" consecutive repeating items:

import numpy as np
import numba as nb

@nb.njit
def mask_more_n(arr, n):
    mask = np.ones(arr.shape, np.bool_)

    current = arr[0]
    count = 0
    for idx, item in enumerate(arr):
        if item == current:
            count += 1
        else:
            current = item
            count = 1
        mask[idx] = count <= n
    return mask

For example:

>>> bins = np.array([1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5])
>>> bins[mask_more_n(bins, 3)]
array([1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5])
>>> bins[mask_more_n(bins, 2)]
array([1, 1, 2, 2, 3, 3, 4, 4, 5, 5])

Performance:

Using simple_benchmark - however I haven't included all approaches. It's a log-log scale:

enter image description here

It seems like the numba solution cannot beat the solution from Paul Panzer which seems to be faster for large arrays by a bit (and doesn't require an additional dependency).

However both seem to outperform the other solutions, but they do return a mask instead of the "filtered" array.

import numpy as np
import numba as nb
from simple_benchmark import BenchmarkBuilder, MultiArgument

b = BenchmarkBuilder()

bins = np.array([1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5])

@nb.njit
def mask_more_n(arr, n):
    mask = np.ones(arr.shape, np.bool_)

    current = arr[0]
    count = 0
    for idx, item in enumerate(arr):
        if item == current:
            count += 1
        else:
            current = item
            count = 1
        mask[idx] = count <= n
    return mask

@b.add_function(warmups=True)
def MSeifert(arr, n):
    return mask_more_n(arr, n)

from scipy.ndimage.morphology import binary_dilation

@b.add_function()
def Divakar_1(a, N):
    k = np.ones(N,dtype=bool)
    m = np.r_[True,a[:-1]!=a[1:]]
    return a[binary_dilation(m,k,origin=-(N//2))]

@b.add_function()
def Divakar_2(a, N):
    k = np.ones(N,dtype=bool)
    return a[binary_dilation(np.ediff1d(a,to_begin=a[0])!=0,k,origin=-(N//2))]

@b.add_function()
def Divakar_3(a, N):
    m = np.r_[True,a[:-1]!=a[1:],True]
    idx = np.flatnonzero(m)
    c = np.diff(idx)
    return np.repeat(a[idx[:-1]],np.minimum(c,N))

from skimage.util import view_as_windows

@b.add_function()
def Divakar_4(a, N):
    m = np.r_[True,a[:-1]!=a[1:]]
    w = view_as_windows(m,N)
    idx = np.flatnonzero(m)
    v = idx<len(w)
    w[idx[v]] = 1
    if v.all()==0:
        m[idx[v.argmin()]:] = 1
    return a[m]

@b.add_function()
def Divakar_5(a, N):
    m = np.r_[True,a[:-1]!=a[1:]]
    w = view_as_windows(m,N)
    last_idx = len(a)-m[::-1].argmax()-1
    w[m[:-N+1]] = 1
    m[last_idx:last_idx+N] = 1
    return a[m]

@b.add_function()
def PaulPanzer(a,N):
    mask = np.empty(a.size,bool)
    mask[:N] = True
    np.not_equal(a[N:],a[:-N],out=mask[N:])
    return mask

import random

@b.add_arguments('array size')
def argument_provider():
    for exp in range(2, 20):
        size = 2**exp
        yield size, MultiArgument([np.array([random.randint(0, 5) for _ in range(size)]), 3])

r = b.run()
import matplotlib.pyplot as plt

plt.figure(figsize=[10, 8])
r.plot()
MSeifert
  • 145,886
  • 38
  • 333
  • 352
  • _"It seems like the numba solution cannot beat the solution from Paul Panzer"_ arguably it is faster for a decent range of sizes. And it is more powerful. I couldn't make mine (well, @FlorianH's) work for nonunique block values without making it much slower. Interestingly, even replicating Florians method with pythran (which typically performs similarly to numba) I couldn't match the numpy implementation for large arrays. pythran doesn't like the `out` argument (or perhaps the functional form of the operator), so I couldn't save that copy. B.t.w. I quite like `simple_benchmark`. – Paul Panzer Oct 21 '19 at 23:17
  • great hint there, to use `simple_benchmark`! thanks for that and thanks of course for the answer. Since I'm using `numba` for other things as well, I'm prone also use it here and make this the solution. between a rock and a hard place there... – FObersteiner Oct 22 '19 at 09:04
2

You could do this with indexing. For any N the code would be:

N = 3
bins = np.array([1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5,6])

mask = [True for _ in range(N)] + list(bins[:-N] != bins[N:])
bins[mask]

output:

array([1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6]
Florian H
  • 3,052
  • 2
  • 14
  • 25
1

You could use a while loop that checks if the array element N positions back is equal to the current one. Note this solution assumes the array is ordered.

import numpy as np

bins = [1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5]
N = 3
counter = N

while counter < len(bins):
    drop_condition = (bins[counter] == bins[counter - N])
    if drop_condition:
        bins = np.delete(bins, counter)
    else:
        # move on to next element
        counter += 1
  • You might want to change `len(question)` to `len(bins)` – Florian H Oct 21 '19 at 08:03
  • sorry if my question is unclear there; I'm not looking to remove elements, I just need a mask that I can use later on (e.g. masking a dependent variable to get equal number of samples per bin). – FObersteiner Oct 21 '19 at 08:08
1

A much nicer way would be to use numpy's unique()-function. You will get unique entries in your array and also the count of how often they appear:

bins = np.array([1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5])
N = 3

unique, index,count = np.unique(bins, return_index=True, return_counts=True)
mask = np.full(bins.shape, True, dtype=bool)
for i,c in zip(index,count):
    if c>N:
        mask[i+N:i+c] = False

bins[mask]

output:

array([1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5])
AnsFourtyTwo
  • 2,480
  • 2
  • 13
  • 33
0

Solution

You could use numpy.unique. The variable final_mask can be used to extract the traget elements from the array bins.

import numpy as np

bins = np.array([1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5])
repeat_max = 3

unique, counts = np.unique(bins, return_counts=True)
mod_counts = np.array([x if x<=repeat_max else repeat_max for x in counts])
mask = np.arange(bins.size)
#final_values = np.hstack([bins[bins==value][:count] for value, count in zip(unique, mod_counts)])
final_mask = np.hstack([mask[bins==value][:count] for value, count in zip(unique, mod_counts)])
bins[final_mask]

Output:

array([1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5])
CypherX
  • 7,019
  • 3
  • 25
  • 37
  • that would require an additional step to get a mask of the same shape as `bins`, right? – FObersteiner Oct 21 '19 at 08:56
  • True: **only if** you are interested in getting the mask first. If you want the **`final_values`** directly, you could **uncomment** the only commented line in the solution and in that case you could discard three lines: `mask = ...`, `final_mask = ...` and `bins[final_mask]`. – CypherX Oct 21 '19 at 09:33
0

You could use grouby to group common elements and filter list that are longer than N.

import numpy as np
from itertools import groupby, chain

def ifElse(condition, exec1, exec2):

    if condition : return exec1 
    else         : return exec2


def solve(bins, N = None):

    xss = groupby(bins)
    xss = map(lambda xs : list(xs[1]), xss)
    xss = map(lambda xs : ifElse(len(xs) > N, xs[:N], xs), xss)
    xs  = chain.from_iterable(xss)
    return list(xs)

bins = np.array([1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5])
solve(bins, N = 3)