Use the data from @suicidalteddy, with minor edits
import numpy as np
import panas as pd
dat = pd.DataFrame([{'Date': '07/01/2016', 'ID': 1, 'product':'A'},
{'Date': '07/01/2016', 'ID': 2, 'product':'B'},
{'Date': '07/01/2016', 'ID': 3, 'product':'D'},
{'Date': '07/04/2016', 'ID': 2, 'product':'F'},
{'Date': '07/05/2016', 'ID': 1, 'product':'G'},
{'Date': '07/09/2016', 'ID': 1, 'product':'K'},
{'Date': '07/09/2016', 'ID': 2, 'product':'L'},
{'Date': '07/09/2016', 'ID': 3, 'product':'X'},
{'Date': '07/11/2016', 'ID': 1, 'product':'N'},
{'Date': '07/13/2016', 'ID': 3, 'product':'M'},
{'Date': '07/17/2016', 'ID': 1, 'product':'Z'},
{'Date': '07/17/2016', 'ID': 2, 'product':'O'},
{'Date': '07/17/2016', 'ID': 3, 'product':'T'},])
dat.Date = pd.to_datetime(dat.Date, format="%m/%d/%Y")
print(dat)
Date ID product
0 2016-07-01 1 A
1 2016-07-01 2 B
2 2016-07-01 3 D
3 2016-07-04 2 F
4 2016-07-05 1 G
5 2016-07-09 1 K
6 2016-07-09 2 L
7 2016-07-09 3 X
8 2016-07-11 1 N
9 2016-07-13 3 M
10 2016-07-17 1 Z
11 2016-07-17 2 O
12 2016-07-17 3 T
Now, use a helper function to
- calculate the time difference between
Date
s (per ID
)
- calculate the week (0-7 days is week 1, 7-14 days is week 2, etc.)
Then, within each group (use groupby
), use these 2 helper functions to calculate the time and week. Since you require only the first entry of each week (7-days), do a second groupby
and retrieve the first record.
Here is the code to calculate the time and week as separate columns
- code to calculate
num_weeks
is from this SO post
dat = dat.sort_values(by=['ID','Date'])
dat['time'] = pd.Series()
dat['week'] = pd.Series([1]*len(dat))
def calc_time(df):
df.loc[:,'time'] = df['Date'] - df.loc[df.index.min(), 'Date']
num_weeks = (
int(np.ceil((dat['Date'].max() - \
dat['Date'].min())/np.timedelta64(1, 'W')))
)
df = calc_week(df, num_weeks) # num_weeks = 3
return df
def calc_week(df, num_weeks):
for week in range(1, num_weeks+1):
# print(str(week*7), str((week+1)*7))
df.loc[\
(df['time'] > str(week*7)+' days') & \
(df['time'] <= str((week+1)*7)+' days'), \
'week'] = week+1
return df
dat = dat.groupby(['ID']).apply(calc_time)
Here is the output of the above step
print(dat)
Date ID product time week
0 2016-07-01 1 A 0 days 1
4 2016-07-05 1 G 4 days 1
5 2016-07-09 1 K 8 days 2
8 2016-07-11 1 N 10 days 2
10 2016-07-17 1 Z 16 days 3
1 2016-07-01 2 B 0 days 1
3 2016-07-04 2 F 3 days 1
6 2016-07-09 2 L 8 days 2
11 2016-07-17 2 O 16 days 3
2 2016-07-01 3 D 0 days 1
7 2016-07-09 3 X 8 days 2
9 2016-07-13 3 M 12 days 2
12 2016-07-17 3 T 16 days 3
Now, the second groupby
and sort to get your final desired output
dat = dat.groupby(['ID','week']).first().reset_index(drop=False)
dff = (
dat[['Date','ID','product']].sort_values(by=['Date','ID'])
.reset_index(drop=True)
)
print(dff)
Date ID product
0 2016-07-01 1 A
1 2016-07-01 2 B
2 2016-07-01 3 D
3 2016-07-09 1 K
4 2016-07-09 2 L
5 2016-07-09 3 X
6 2016-07-17 1 Z
7 2016-07-17 2 O
8 2016-07-17 3 T
Original Attempt
Generate some data in same format as OP
idx = pd.date_range('2018-04-01', '2018-05-01', freq='1D') + \
pd.DateOffset(days=16)
df = pd.DataFrame(idx, columns=['Date'])
df.set_index('Date', inplace=True)
df['ID'] = range(len(idx))
df['product'] = range(12,12+len(idx))
print(df)
ID product
Date
2018-04-17 0 12
2018-04-18 1 13
...
2018-05-16 29 41
2018-05-17 30 42
Use pd.data_range
to pick out only the required intervals
idx = pd.date_range(df.index[0], df.index[-1], freq='7D')
df_seven_days = df.loc[idx].reset_index(drop=False)
df_seven_days.rename(columns={'index': 'Date'}, inplace=True)
print(df_seven_days)
Date ID product
2018-04-17 0 12
2018-04-24 7 19
2018-05-01 14 26
2018-05-08 21 33
2018-05-15 28 40