Pandas Select all Rows with Either of the Top 2 Values in a Particular Column

Question

I have the following sample data:

IND ID  value  EFFECTIVE DT    SYSDATE
8   A   19289   6/30/2017   8/16/2017 10:05
17  A   19289   6/30/2017   8/15/2017 14:25
26  A   19289   6/30/2017   8/14/2017 15:10
7   A   18155   3/31/2017   8/16/2017 10:05
16  A   18155   3/31/2017   8/15/2017 14:25
25  A   18155   3/31/2017   8/14/2017 15:10
6   A   21770   12/31/2016  8/16/2017 10:05
15  A   21770   12/31/2016  8/15/2017 14:25
24  A   21770   12/31/2016  8/14/2017 15:10
5   A   19226   9/30/2016   8/16/2017 10:05
14  A   19226   9/30/2016   8/15/2017 14:25
23  A   19226   9/30/2016   8/14/2017 15:10
4   A   20238   6/30/2016   8/16/2017 10:05
13  A   20238   6/30/2016   8/15/2017 14:25
22  A   20238   6/30/2016   8/14/2017 15:10
3   A   18684   3/31/2016   8/16/2017 10:05
12  A   18684   3/31/2016   8/15/2017 14:25
21  A   18684   3/31/2016   8/14/2017 15:10
2   A   22059   12/31/2015  8/16/2017 10:05
11  A   22059   12/31/2015  8/15/2017 14:25
20  A   22059   12/31/2015  8/14/2017 15:10
1   A   19280   9/30/2015   8/16/2017 10:05
10  A   19280   9/30/2015   8/15/2017 14:25
19  A   19280   9/30/2015   8/14/2017 15:10
0   A   20813   6/30/2015   8/16/2017 10:05
9   A   20813   6/30/2015   8/15/2017 14:25
18  A   20813   6/30/2015   8/14/2017 15:10

It is a set of data I collect every weekday (SYSDATE is a timestamp).

I want to generate a df with only the rows stamped with either of the two most recent timestamp values every day.

So I'm looking to get this (from much larger set with many timestamps) if I were to run the script today:

IND ID  Value   EFFECTIVE DT    SYSDATE
8   A   19289   6/30/2017   8/16/2017 10:05
17  A   19289   6/30/2017   8/15/2017 14:25
7   A   18155   3/31/2017   8/16/2017 10:05
16  A   18155   3/31/2017   8/15/2017 14:25
6   A   21770   12/31/2016  8/16/2017 10:05
15  A   21770   12/31/2016  8/15/2017 14:25
5   A   19226   9/30/2016   8/16/2017 10:05
14  A   19226   9/30/2016   8/15/2017 14:25
4   A   20238   6/30/2016   8/16/2017 10:05
13  A   20238   6/30/2016   8/15/2017 14:25
3   A   18684   3/31/2016   8/16/2017 10:05
12  A   18684   3/31/2016   8/15/2017 14:25
2   A   22059   12/31/2015  8/16/2017 10:05
11  A   22059   12/31/2015  8/15/2017 14:25
1   A   19280   9/30/2015   8/16/2017 10:05
10  A   19280   9/30/2015   8/15/2017 14:25
0   A   20813   6/30/2015   8/16/2017 10:05
9   A   20813   6/30/2015   8/15/2017 14:25

I can't use datetimes because of weekends and holidays.

Suggestions?

Thanks in advance.

hope the link can help https://stackoverflow.com/questions/32530856/how-to-keep-first-two-duplicates-in-a-pandas-dataframe — BENY, Aug 16 '17 at 19:57
@jezrael Not huge but I'm always in favor of doing things efficiently. If you can share that would be great. — Windstorm1981, Aug 16 '17 at 20:09

piRSquared · Accepted Answer · 2017-08-17T06:41:09.963

You need to first make sure SYSDATE is turned into datetime. I'll do it for EFFECTIVE DT as well.

df[['EFFECTIVE DT', 'SYSDATE']] = \
    df[['EFFECTIVE DT', 'SYSDATE']].apply(pd.to_datetime)

Option 1
pir1
Use groupby.apply with the dataframe method pd.DataFrame.nlargest where you pass the parameters columns='SYSDATE' and n=2 for the largest two 'SYSDATE's.

df.groupby(
    'EFFECTIVE DT', group_keys=False, sort=False
).apply(pd.DataFrame.nlargest, n=2, columns='SYSDATE')

    IND ID  value EFFECTIVE DT             SYSDATE
0     8  A  19289   2017-06-30 2017-08-16 10:05:00
1    17  A  19289   2017-06-30 2017-08-15 14:25:00
3     7  A  18155   2017-03-31 2017-08-16 10:05:00
4    16  A  18155   2017-03-31 2017-08-15 14:25:00
6     6  A  21770   2016-12-31 2017-08-16 10:05:00
7    15  A  21770   2016-12-31 2017-08-15 14:25:00
9     5  A  19226   2016-09-30 2017-08-16 10:05:00
10   14  A  19226   2016-09-30 2017-08-15 14:25:00
12    4  A  20238   2016-06-30 2017-08-16 10:05:00
13   13  A  20238   2016-06-30 2017-08-15 14:25:00
15    3  A  18684   2016-03-31 2017-08-16 10:05:00
16   12  A  18684   2016-03-31 2017-08-15 14:25:00
18    2  A  22059   2015-12-31 2017-08-16 10:05:00
19   11  A  22059   2015-12-31 2017-08-15 14:25:00
21    1  A  19280   2015-09-30 2017-08-16 10:05:00
22   10  A  19280   2015-09-30 2017-08-15 14:25:00
24    0  A  20813   2015-06-30 2017-08-16 10:05:00
25    9  A  20813   2015-06-30 2017-08-15 14:25:00

How It Works
To start, splitting, applying stuff to splits, and recombining your efforts is a key feature of pandas and is explained well here split-apply-combine.

The groupby element should be self-evident. I want to group the data by each day as defined by the dates in the 'EFFECTIVE DT' column. After that, you can do many things with this groupby object. I decided to apply a function that will return the 2 rows that correspond to the largest two values of the 'SYSDATE' column. Those largest values equate to the most recent for the day of the group.

It turns out that there is a dataframe method that performs this task of returning the rows corresponding to the largest values of a column. Namely, pd.DataFrame.nlargest.

Two things to note:

When we use groupby.apply, the object being passed to the function being applied is a pd.DataFrame object.
When using a method like pd.DataFrame.nlargest as a function, the first argument that is expected is a pd.DataFrame object.

Well, that's fortunate, because that's exactly what I'm doing.

Also, groupby.apply allows you to pass additional key word arguments to the applied function via kwargs. So, I can pass n=2 and columns='SYSDATE' easily.

Option 2
pir2
Same concept as option 1 but using np.argpartion

def nlrg(d):
    v = d.HOURS.values
    a = np.argpartition(v, v.size - 2)[-2:]
    return d.iloc[a]

pir2 = lambda d: d.groupby('DAYS', sort=False, group_keys=False).apply(nlrg)

Option 3
pir4
Using numba.njit
I scan through the list tracking last two maximum values.

form numba import njit

@njit
def nlrg_nb(f, v, i, n):
    b = (np.arange(n * 2) * 0).reshape(-1, 2)
    e = b * np.nan
    for x, y, z in zip(f, v, i):
        if np.isnan(e[x, 0]):
            e[x, 0] = y
            b[x, 0] = z
        elif y > e[x, 0]:
            e[x, :] = [y, e[x, 0]]
            b[x, :] = [z, b[x, 0]]
        elif np.isnan(e[x, 1]):
            e[x, 1] = y
            b[x, 1] = z
        elif y > e[x, 1]:
            e[x, 1] = y
            b[x, 1] = z
    return b.ravel()[~np.isnan(e.ravel())]

def pir4(d):
    f, u = pd.factorize(d.DAYS.values)
    return d.iloc[nlrg_nb(f, d.HOURS.values.astype(float), np.arange(f.size), u.size)]

Timing

Results

(lambda r: r.div(r.min(1), 0))(results)

             pir1        pir2  pir4      jez1
100     24.205348    9.725718   1.0  4.449165
300     42.685989   15.754161   1.0  4.047182
1000   111.733703   39.822652   1.0  4.175235
3000   253.873888   74.280675   1.0  4.105493
10000  376.157526  125.323946   1.0  4.313063
30000  434.815009  145.513904   1.0  5.296250

Simulation

def produce_test_df(i):
    hours = pd.date_range('2000-01-01', periods=i, freq='H')[np.random.permutation(np.arange(i))]
    days = hours.floor('D')
    return pd.DataFrame(dict(HOURS=hours, DAYS=days))

results = pd.DataFrame(
    index=[100, 300, 1000, 3000, 10000, 30000],
    columns='pir1 pir2 pir4 jez1'.split(),
    dtype=float,
)

for i in results.index:
    d = produce_test_df(i)
    for j in results.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        results.set_value(i, j, timeit(stmt, setp, number=20))

results.plot(loglog=True)

Functions

def nlrg(d):
    v = d.HOURS.values
    a = np.argpartition(v, v.size - 2)[-2:]
    return d.iloc[a]

pir1 = lambda d: d.groupby('DAYS', group_keys=False, sort=False).apply(pd.DataFrame.nlargest, n=2, columns='HOURS')
pir2 = lambda d: d.groupby('DAYS', sort=False, group_keys=False).apply(nlrg)
jez1 = lambda d: d.sort_values(['DAYS', 'HOURS']).groupby('DAYS').tail(2)

@njit
def nlrg_nb(f, v, i, n):
    b = (np.arange(n * 2) * 0).reshape(-1, 2)
    e = b * np.nan
    for x, y, z in zip(f, v, i):
        if np.isnan(e[x, 0]):
            e[x, 0] = y
            b[x, 0] = z
        elif y > e[x, 0]:
            e[x, :] = [y, e[x, 0]]
            b[x, :] = [z, b[x, 0]]
        elif np.isnan(e[x, 1]):
            e[x, 1] = y
            b[x, 1] = z
        elif y > e[x, 1]:
            e[x, 1] = y
            b[x, 1] = z
    return b.ravel()[~np.isnan(e.ravel())]

def pir4(d):
    f, u = pd.factorize(d.DAYS.values)
    return d.iloc[nlrg_nb(f, d.HOURS.values.astype(float), np.arange(f.size), u.size)]

jezrael · Answer 2 · 2017-08-16T20:36:33.683

I think you need sort_values with groupby and return GroupBy.head:

#convert to datetime
df['SYSDATE'] = pd.to_datetime(df['SYSDATE'])
df['EFFECTIVE DT'] = pd.to_datetime(df['EFFECTIVE DT'])

df = df.sort_values(['EFFECTIVE DT','SYSDATE'], ascending=[True,False])
df = df.groupby('EFFECTIVE DT').head(2).sort_index()
print (df)
    IND ID  value EFFECTIVE DT             SYSDATE
0     8  A  19289   2017-06-30 2017-08-16 10:05:00
1    17  A  19289   2017-06-30 2017-08-15 14:25:00
3     7  A  18155   2017-03-31 2017-08-16 10:05:00
4    16  A  18155   2017-03-31 2017-08-15 14:25:00
6     6  A  21770   2016-12-31 2017-08-16 10:05:00
7    15  A  21770   2016-12-31 2017-08-15 14:25:00
9     5  A  19226   2016-09-30 2017-08-16 10:05:00
10   14  A  19226   2016-09-30 2017-08-15 14:25:00
12    4  A  20238   2016-06-30 2017-08-16 10:05:00
13   13  A  20238   2016-06-30 2017-08-15 14:25:00
15    3  A  18684   2016-03-31 2017-08-16 10:05:00
16   12  A  18684   2016-03-31 2017-08-15 14:25:00
18    2  A  22059   2015-12-31 2017-08-16 10:05:00
19   11  A  22059   2015-12-31 2017-08-15 14:25:00
21    1  A  19280   2015-09-30 2017-08-16 10:05:00
22   10  A  19280   2015-09-30 2017-08-15 14:25:00
24    0  A  20813   2015-06-30 2017-08-16 10:05:00
25    9  A  20813   2015-06-30 2017-08-15 14:25:00

Another similar solution, thank you piRSquared :

df = df.sort_values(['EFFECTIVE DT','SYSDATE']) \
       .groupby('EFFECTIVE DT') \
       .tail(2) \
       .sort_index()
print (df)
    IND ID  value EFFECTIVE DT             SYSDATE
0     8  A  19289   2017-06-30 2017-08-16 10:05:00
1    17  A  19289   2017-06-30 2017-08-15 14:25:00
3     7  A  18155   2017-03-31 2017-08-16 10:05:00
4    16  A  18155   2017-03-31 2017-08-15 14:25:00
6     6  A  21770   2016-12-31 2017-08-16 10:05:00
7    15  A  21770   2016-12-31 2017-08-15 14:25:00
9     5  A  19226   2016-09-30 2017-08-16 10:05:00
10   14  A  19226   2016-09-30 2017-08-15 14:25:00
12    4  A  20238   2016-06-30 2017-08-16 10:05:00
13   13  A  20238   2016-06-30 2017-08-15 14:25:00
15    3  A  18684   2016-03-31 2017-08-16 10:05:00
16   12  A  18684   2016-03-31 2017-08-15 14:25:00
18    2  A  22059   2015-12-31 2017-08-16 10:05:00
19   11  A  22059   2015-12-31 2017-08-15 14:25:00
21    1  A  19280   2015-09-30 2017-08-16 10:05:00
22   10  A  19280   2015-09-30 2017-08-15 14:25:00
24    0  A  20813   2015-06-30 2017-08-16 10:05:00
25    9  A  20813   2015-06-30 2017-08-15 14:25:00

Timings:

np.random.seed(345)

N = 100000
dates1 = [pd.Timestamp('2017-08-16 10:05:00'), pd.Timestamp('2017-08-15 14:25:00'), pd.Timestamp('2017-08-14 15:10:00'), pd.Timestamp('2017-08-16 10:05:00'), pd.Timestamp('2017-08-15 14:25:00'), pd.Timestamp('2017-08-14 15:10:00'), pd.Timestamp('2017-08-16 10:05:00'), pd.Timestamp('2017-08-15 14:25:00'), pd.Timestamp('2017-08-14 15:10:00'), pd.Timestamp('2017-08-16 10:05:00'), pd.Timestamp('2017-08-15 14:25:00'), pd.Timestamp('2017-08-14 15:10:00'), pd.Timestamp('2017-08-16 10:05:00'), pd.Timestamp('2017-08-15 14:25:00'), pd.Timestamp('2017-08-14 15:10:00'), pd.Timestamp('2017-08-16 10:05:00'), pd.Timestamp('2017-08-15 14:25:00'), pd.Timestamp('2017-08-14 15:10:00'), pd.Timestamp('2017-08-16 10:05:00'), pd.Timestamp('2017-08-15 14:25:00'), pd.Timestamp('2017-08-14 15:10:00'), pd.Timestamp('2017-08-16 10:05:00'), pd.Timestamp('2017-08-15 14:25:00'),pd.Timestamp('2017-08-14 15:10:00')]
dates2 = pd.date_range('2015-01-01', periods=2000)

df = pd.DataFrame({'EFFECTIVE DT':np.random.choice(dates2, N),
                   'SYSDATE':np.random.choice(dates1, N)})

In [104]: %timeit df.groupby('EFFECTIVE DT', group_keys=False, sort=False).apply(pd.DataFrame.nlargest, n=2, columns='SYSDATE')
1 loop, best of 3: 2.61 s per loop

In [105]: %timeit df.sort_values(['EFFECTIVE DT','SYSDATE'], ascending=[True,False]).groupby('EFFECTIVE DT').head(2).sort_index()
10 loops, best of 3: 25.4 ms per loop

In [106]: %timeit df.sort_values(['EFFECTIVE DT','SYSDATE']).groupby('EFFECTIVE DT').tail(2).sort_index()
10 loops, best of 3: 23.3 ms per loop

I think you can forgo the `ascending` parameter if you use `tail` — piRSquared, Aug 16 '17 at 19:50
`df.sort_values(['EFFECTIVE DT','SYSDATE']).groupby('EFFECTIVE DT').tail(2).sort_index()` is what I used — piRSquared, Aug 16 '17 at 19:53

Pandas Select all Rows with Either of the Top 2 Values in a Particular Column

2 Answers2