2

My question is based on this thread, where we group values of a pandas dataframe and select the latest (by date) from each group:

    id     product   date
0   220    6647     2014-09-01 
1   220    6647     2014-09-03 
2   220    6647     2014-10-16
3   826    3380     2014-11-11
4   826    3380     2014-12-09
5   826    3380     2015-05-19
6   901    4555     2014-09-01
7   901    4555     2014-10-05
8   901    4555     2014-11-01

using the following

df.loc[df.groupby('id').date.idxmax()]

Say, however, that I want to include the condition that I only want to select the latest (by date) from each group within +/- 5 days. I.e., after grouping I want to find the latest within the following groups:

0   220    6647     2014-09-01 #because only these two are within +/- 5 days of each other
1   220    6647     2014-09-03 

2   220    6647     2014-10-16 #spaced more than 5 days apart the above two records

3   826    3380     2014-11-11

.....

which yields

    id  product       date
1  220     6647 2014-09-03 
2  220     6647 2014-10-16
3  826     3380 2014-11-11
4  826     3380 2014-12-09
5  826     3380 2015-05-19
5  826     3380 2015-05-19
6  901     4555 2014-09-01
7  901     4555 2014-10-05
8  901     4555 2014-11-01

Dataset with price:

    id     product   date           price
0   220    6647     2014-09-01      100   #group 1
1   220    6647     2014-09-03      120   #group 1   --> pick this
2   220    6647     2014-09-05      0     #group 1
3   826    3380     2014-11-11      150   #group 2   --> pick this
4   826    3380     2014-12-09      23    #group 3   --> pick this
5   826    3380     2015-05-12      88    #group 4   --> pick this
6   901    4555     2015-05-15      32    #group 4   
7   901    4555     2015-10-05      542   #group 5   --> pick this
8   901    4555     2015-11-01      98    #group 6   --> pick this
N08
  • 1,265
  • 13
  • 23

2 Answers2

1

I think you need create groups by apply with list comprehension and between, then convert to numeric groups by factorize, last use your solution with loc + idxmax:

df['date'] = pd.to_datetime(df['date'])

df = df.reset_index(drop=True)
td = pd.Timedelta('5 days')

def f(x):
    x['g']  = [tuple((x.index[x['date'].between(i - td, i + td)])) for i in x['date']]
    return x

df2 = df.groupby('id').apply(f)
df2['g'] = pd.factorize(df2['g'])[0]
print (df2)
    id  product       date  price  g
0  220     6647 2014-09-01    100  0
1  220     6647 2014-09-03    120  0
2  220     6647 2014-09-05      0  0
3  826     3380 2014-11-11    150  1
4  826     3380 2014-12-09     23  2
5  826     3380 2015-05-12     88  3
6  901     4555 2015-05-15     32  4
7  901     4555 2015-10-05    542  5
8  901     4555 2015-11-01     98  6

df3 = df2.loc[df2.groupby('g')['price'].idxmax()]
print (df3)
    id  product       date  price  g
1  220     6647 2014-09-03    120  0
3  826     3380 2014-11-11    150  1
4  826     3380 2014-12-09     23  2
5  826     3380 2015-05-12     88  3
6  901     4555 2015-05-15     32  4
7  901     4555 2015-10-05    542  5
8  901     4555 2015-11-01     98  6
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • This is a good solution, thanks. Say that there is a fourth column, price. Is it possible to generalize the solution such that within a given `group` of 5-day records, we pick the product that has the largest price? – N08 Dec 14 '18 at 10:13
  • @N08 - Because necesary filtering again I think you can add your solution `df1.loc[df1.groupby('id').date.idxmax()]` – jezrael Dec 14 '18 at 10:19
  • But don't we still need to filter based on `s`? – N08 Dec 14 '18 at 10:22
  • Yes, first filter by my solution for only `>5 days` groups and then filter by `max` or added solution with `apply` – jezrael Dec 14 '18 at 10:31
  • 1
    @N08 - but solution with apply is slow in larger DataFrame – jezrael Dec 14 '18 at 10:31
  • I updated the OP with an example of what should happen in this new case. The comments in the example show how the groups look and what record within each group must be chosen. I can't seem to make it work as desired when filtering by `s` first and then `max()` – N08 Dec 14 '18 at 11:58
  • Thanks. I have to find a way to optimize this, it is slow for large dataframes... – N08 Dec 14 '18 at 16:02
  • 1
    yes, agree. Unfortunately really not easy vectorized, because always necessary compare each value of group by all values of groups. – jezrael Dec 14 '18 at 16:16
0

Or use a two-liner:

df2=pd.to_numeric(df.groupby('id')['date'].diff(-1).astype(str).str[:-25]).abs().fillna(6)
print(df.loc[df2.index[df2>5].tolist()])

Output:

    id  product       date
1  220     6647 2014-09-03
2  220     6647 2014-10-16
3  826     3380 2014-11-11
4  826     3380 2014-12-09
5  826     3380 2015-05-19
6  901     4555 2014-09-01
7  901     4555 2014-10-05
8  901     4555 2014-11-01

So use diff and slice using string slice, and absolute all the values, then drop the ones less than 5, get those indexes, then get the indexes in the in df.

U13-Forward
  • 69,221
  • 14
  • 89
  • 114