Approach #1 : One of the NumPy ways would be -
a = df.A.values
unq, c = np.unique(a, return_counts=1)
df_out = df[np.in1d(a,unq[c>=4])]
Approach #2 : NumPy + Pandas mix one-liner for positive numbers -
df[df.A.isin(np.flatnonzero(np.bincount(df.A)>=4))]
Approach #3 : Using the fact that the dataframe is sorted on the relevant column, here's one deeper NumPy approach -
def filter_df(df, N=4):
a = df.A.values
mask = np.concatenate(( [True], a[1:] != a[:-1], [True] ))
idx = np.flatnonzero(mask)
count = idx[1:] - idx[:-1]
valid_mask = count>=N
good_idx = idx[:-1][valid_mask]
out_arr = np.repeat(a[good_idx], count[valid_mask])
out_df = pd.DataFrame(out_arr)
return out_df
Benchmarking
@piRSquared
has covered an extensive benchmarking for all the approaches posted thus far and as it seems pir1_5
and div3
appears to be the top-2
there. But the timings seems comparable and it promoted me for a closer look. In that benchmarking, we had timeit(stmt, setp, number=10)
, which uses a constant number of iterations for running timeit
, which isn't the most reliable method for timing, specially for small datasets. Also, the datasets seemed small there, as the timings for the biggest dataset were in micro-sec
. So, to mitigate those two issues - I am proposing to use IPython's %timeit
that automatically computes the optimal number of iterations for timeit to be run for, i.e. more number of iterations for smaller datasets than bigger ones. This should be more reliable. Also, I propose to include bigger datasets, so that the timings go into milli-sec
and sec
. So, with those couple of changes, new benchmarking setup looked something like this (remember to copy and paste onto IPython console to run this) -
sizes =[10, 30, 100, 300, 1000, 3000, 10000, 100000, 1000000, 10000000]
timings = np.zeros((len(sizes),2))
for i,s in enumerate(sizes):
diffs = np.random.randint(100, size=s)
d = pd.DataFrame(dict(A=np.arange(s).repeat(diffs)))
res = %timeit -oq div3(d)
timings[i,0] = res.best
res = %timeit -oq pir1_5(d)
timings[i,1] = res.best
timings_df = pd.DataFrame(timings, columns=(('div3(sec)','pir1_5(sec)')))
timings_df.index = sizes
timings_df.index.name = 'Datasizes'
For completeness, the approaches were -
def pir1_5(d):
v = d.A.values
t = np.flatnonzero(v[1:] != v[:-1])
s = np.empty(t.size + 2, int)
s[0] = -1
s[-1] = v.size - 1
s[1:-1] = t
r = np.diff(s)
return pd.DataFrame(v[(r > 3).repeat(r)])
def div3(df, N=4):
a = df.A.values
mask = np.concatenate(( [True], a[1:] != a[:-1], [True] ))
idx = np.flatnonzero(mask)
count = idx[1:] - idx[:-1]
valid_mask = count>=N
good_idx = idx[:-1][valid_mask]
out_arr = np.repeat(a[good_idx], count[valid_mask])
return pd.DataFrame(out_arr)
The timing setup was run on IPython console (as we are using magic funcs). The results looked like this -
In [265]: timings_df
Out[265]:
div3(sec) pir1_5(sec)
Datasizes
10 0.000090 0.000089
30 0.000096 0.000097
100 0.000109 0.000118
300 0.000157 0.000182
1000 0.000303 0.000396
3000 0.000713 0.000998
10000 0.002252 0.003701
100000 0.023257 0.036480
1000000 0.258133 0.398812
10000000 2.603467 3.759063
Thus, speedup figures with div3
over pir1_5
are :
In [266]: timings_df.iloc[:,1]/timings_df.iloc[:,0]
Out[266]:
Datasizes
10 0.997704
30 1.016446
100 1.077129
300 1.163333
1000 1.304689
3000 1.400464
10000 1.643474
100000 1.568554
1000000 1.544987
10000000 1.443868