You need to first make sure SYSDATE
is turned into datetime
. I'll do it for EFFECTIVE DT
as well.
df[['EFFECTIVE DT', 'SYSDATE']] = \
df[['EFFECTIVE DT', 'SYSDATE']].apply(pd.to_datetime)
Option 1
pir1
Use groupby.apply
with the dataframe method pd.DataFrame.nlargest
where you pass the parameters columns='SYSDATE'
and n=2
for the largest two 'SYSDATE'
s.
df.groupby(
'EFFECTIVE DT', group_keys=False, sort=False
).apply(pd.DataFrame.nlargest, n=2, columns='SYSDATE')
IND ID value EFFECTIVE DT SYSDATE
0 8 A 19289 2017-06-30 2017-08-16 10:05:00
1 17 A 19289 2017-06-30 2017-08-15 14:25:00
3 7 A 18155 2017-03-31 2017-08-16 10:05:00
4 16 A 18155 2017-03-31 2017-08-15 14:25:00
6 6 A 21770 2016-12-31 2017-08-16 10:05:00
7 15 A 21770 2016-12-31 2017-08-15 14:25:00
9 5 A 19226 2016-09-30 2017-08-16 10:05:00
10 14 A 19226 2016-09-30 2017-08-15 14:25:00
12 4 A 20238 2016-06-30 2017-08-16 10:05:00
13 13 A 20238 2016-06-30 2017-08-15 14:25:00
15 3 A 18684 2016-03-31 2017-08-16 10:05:00
16 12 A 18684 2016-03-31 2017-08-15 14:25:00
18 2 A 22059 2015-12-31 2017-08-16 10:05:00
19 11 A 22059 2015-12-31 2017-08-15 14:25:00
21 1 A 19280 2015-09-30 2017-08-16 10:05:00
22 10 A 19280 2015-09-30 2017-08-15 14:25:00
24 0 A 20813 2015-06-30 2017-08-16 10:05:00
25 9 A 20813 2015-06-30 2017-08-15 14:25:00
How It Works
To start, splitting, applying stuff to splits, and recombining your efforts is a key feature of pandas
and is explained well here split-apply-combine.
The groupby
element should be self-evident. I want to group the data by each day as defined by the dates in the 'EFFECTIVE DT'
column. After that, you can do many things with this groupby
object. I decided to apply a function that will return the 2 rows that correspond to the largest two values of the 'SYSDATE'
column. Those largest values equate to the most recent for the day of the group.
It turns out that there is a dataframe method that performs this task of returning the rows corresponding to the largest values of a column. Namely, pd.DataFrame.nlargest
.
Two things to note:
- When we use
groupby.apply
, the object being passed to the function being applied is a pd.DataFrame
object.
- When using a method like
pd.DataFrame.nlargest
as a function, the first argument that is expected is a pd.DataFrame
object.
Well, that's fortunate, because that's exactly what I'm doing.
Also, groupby.apply
allows you to pass additional key word arguments to the applied function via kwargs
. So, I can pass n=2
and columns='SYSDATE'
easily.
Option 2
pir2
Same concept as option 1 but using np.argpartion
def nlrg(d):
v = d.HOURS.values
a = np.argpartition(v, v.size - 2)[-2:]
return d.iloc[a]
pir2 = lambda d: d.groupby('DAYS', sort=False, group_keys=False).apply(nlrg)
Option 3
pir4
Using numba.njit
I scan through the list tracking last two maximum values.
form numba import njit
@njit
def nlrg_nb(f, v, i, n):
b = (np.arange(n * 2) * 0).reshape(-1, 2)
e = b * np.nan
for x, y, z in zip(f, v, i):
if np.isnan(e[x, 0]):
e[x, 0] = y
b[x, 0] = z
elif y > e[x, 0]:
e[x, :] = [y, e[x, 0]]
b[x, :] = [z, b[x, 0]]
elif np.isnan(e[x, 1]):
e[x, 1] = y
b[x, 1] = z
elif y > e[x, 1]:
e[x, 1] = y
b[x, 1] = z
return b.ravel()[~np.isnan(e.ravel())]
def pir4(d):
f, u = pd.factorize(d.DAYS.values)
return d.iloc[nlrg_nb(f, d.HOURS.values.astype(float), np.arange(f.size), u.size)]
Timing
Results
(lambda r: r.div(r.min(1), 0))(results)
pir1 pir2 pir4 jez1
100 24.205348 9.725718 1.0 4.449165
300 42.685989 15.754161 1.0 4.047182
1000 111.733703 39.822652 1.0 4.175235
3000 253.873888 74.280675 1.0 4.105493
10000 376.157526 125.323946 1.0 4.313063
30000 434.815009 145.513904 1.0 5.296250

Simulation
def produce_test_df(i):
hours = pd.date_range('2000-01-01', periods=i, freq='H')[np.random.permutation(np.arange(i))]
days = hours.floor('D')
return pd.DataFrame(dict(HOURS=hours, DAYS=days))
results = pd.DataFrame(
index=[100, 300, 1000, 3000, 10000, 30000],
columns='pir1 pir2 pir4 jez1'.split(),
dtype=float,
)
for i in results.index:
d = produce_test_df(i)
for j in results.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
results.set_value(i, j, timeit(stmt, setp, number=20))
results.plot(loglog=True)
Functions
def nlrg(d):
v = d.HOURS.values
a = np.argpartition(v, v.size - 2)[-2:]
return d.iloc[a]
pir1 = lambda d: d.groupby('DAYS', group_keys=False, sort=False).apply(pd.DataFrame.nlargest, n=2, columns='HOURS')
pir2 = lambda d: d.groupby('DAYS', sort=False, group_keys=False).apply(nlrg)
jez1 = lambda d: d.sort_values(['DAYS', 'HOURS']).groupby('DAYS').tail(2)
@njit
def nlrg_nb(f, v, i, n):
b = (np.arange(n * 2) * 0).reshape(-1, 2)
e = b * np.nan
for x, y, z in zip(f, v, i):
if np.isnan(e[x, 0]):
e[x, 0] = y
b[x, 0] = z
elif y > e[x, 0]:
e[x, :] = [y, e[x, 0]]
b[x, :] = [z, b[x, 0]]
elif np.isnan(e[x, 1]):
e[x, 1] = y
b[x, 1] = z
elif y > e[x, 1]:
e[x, 1] = y
b[x, 1] = z
return b.ravel()[~np.isnan(e.ravel())]
def pir4(d):
f, u = pd.factorize(d.DAYS.values)
return d.iloc[nlrg_nb(f, d.HOURS.values.astype(float), np.arange(f.size), u.size)]