0

I am very new to python and pandas and am working on a pandas data frame which is a 3 month data and looks like:

Date     ID   product
Jul-1     1     A
Jul-1     2     B
Jul-1     3     D
Jul-4     2     F
Jul-5     1     G
Jul-9     1     K
Jul-9     2     L
Jul-9     3     X
Jul-11    1     N
Jul-13    3     M
Jul-17    1     Z
Jul-17    2     O
Jul-17    3     T

What I am trying to do is to select the ID's only at the increment of 7 days, such that it does not select any ID if it is present for the next seven days from the last date it was picked. The resulting data frame would look like this

Date     ID   product
Jul-1     1     A
Jul-1     2     B
Jul-1     3     D
Jul-9     1     K
Jul-9     2     L
Jul-9     3     X
Jul-17    1     Z
Jul-17    2     O
Jul-17    3     T

Have been trying since morning and any help would be highly appreciated

Thanks a lot in advance!!

Analytics_TM
  • 493
  • 6
  • 28
  • Without the year in the date column, it is not possible. Will it always be the current year? – entropy Mar 07 '19 at 04:33
  • Thanks for your response! We have the data for last 3 months and thus the year would be the same.The actual format of the date is `mm/dd/yyyy`.The given date is just an example. – Analytics_TM Mar 07 '19 at 04:38
  • 1
    @AnalyticsTeam please do not post such examples, users would just waste their time in replicating the issue which is different than yours since you have a different format. Why not just post an example which matches your formats and dtypes – anky Mar 07 '19 at 05:15

2 Answers2

2

Use the data from @suicidalteddy, with minor edits

import numpy as np
import panas as pd

dat = pd.DataFrame([{'Date': '07/01/2016', 'ID': 1, 'product':'A'},
                {'Date': '07/01/2016', 'ID': 2, 'product':'B'},
                {'Date': '07/01/2016', 'ID': 3, 'product':'D'},
                {'Date': '07/04/2016', 'ID': 2, 'product':'F'},
                {'Date': '07/05/2016', 'ID': 1, 'product':'G'},
                {'Date': '07/09/2016', 'ID': 1, 'product':'K'},
                {'Date': '07/09/2016', 'ID': 2, 'product':'L'},
                {'Date': '07/09/2016', 'ID': 3, 'product':'X'},
                {'Date': '07/11/2016', 'ID': 1, 'product':'N'},
                {'Date': '07/13/2016', 'ID': 3, 'product':'M'},
                {'Date': '07/17/2016', 'ID': 1, 'product':'Z'},
                {'Date': '07/17/2016', 'ID': 2, 'product':'O'},
                {'Date': '07/17/2016', 'ID': 3, 'product':'T'},])
dat.Date = pd.to_datetime(dat.Date, format="%m/%d/%Y")

print(dat)
         Date  ID product
0  2016-07-01   1       A
1  2016-07-01   2       B
2  2016-07-01   3       D
3  2016-07-04   2       F
4  2016-07-05   1       G
5  2016-07-09   1       K
6  2016-07-09   2       L
7  2016-07-09   3       X
8  2016-07-11   1       N
9  2016-07-13   3       M
10 2016-07-17   1       Z
11 2016-07-17   2       O
12 2016-07-17   3       T

Now, use a helper function to

  1. calculate the time difference between Dates (per ID)
  2. calculate the week (0-7 days is week 1, 7-14 days is week 2, etc.)

Then, within each group (use groupby), use these 2 helper functions to calculate the time and week. Since you require only the first entry of each week (7-days), do a second groupby and retrieve the first record.

Here is the code to calculate the time and week as separate columns

  • code to calculate num_weeks is from this SO post
dat = dat.sort_values(by=['ID','Date'])
dat['time'] = pd.Series()
dat['week'] = pd.Series([1]*len(dat))

def calc_time(df):
    df.loc[:,'time'] = df['Date'] - df.loc[df.index.min(), 'Date']
    num_weeks = (
        int(np.ceil((dat['Date'].max() - \
            dat['Date'].min())/np.timedelta64(1, 'W')))
                )
    df = calc_week(df, num_weeks) # num_weeks = 3
    return df

def calc_week(df, num_weeks):
    for week in range(1, num_weeks+1):
        # print(str(week*7), str((week+1)*7))
        df.loc[\
            (df['time'] > str(week*7)+' days') & \
            (df['time'] <= str((week+1)*7)+' days'), \
        'week'] = week+1
    return df

dat = dat.groupby(['ID']).apply(calc_time)

Here is the output of the above step

print(dat)
         Date  ID product    time  week
0  2016-07-01   1       A  0 days     1
4  2016-07-05   1       G  4 days     1
5  2016-07-09   1       K  8 days     2
8  2016-07-11   1       N 10 days     2
10 2016-07-17   1       Z 16 days     3
1  2016-07-01   2       B  0 days     1
3  2016-07-04   2       F  3 days     1
6  2016-07-09   2       L  8 days     2
11 2016-07-17   2       O 16 days     3
2  2016-07-01   3       D  0 days     1
7  2016-07-09   3       X  8 days     2
9  2016-07-13   3       M 12 days     2
12 2016-07-17   3       T 16 days     3

Now, the second groupby and sort to get your final desired output

dat = dat.groupby(['ID','week']).first().reset_index(drop=False)
dff = (
    dat[['Date','ID','product']].sort_values(by=['Date','ID'])
                                .reset_index(drop=True)
    )

print(dff)
        Date  ID product
0 2016-07-01   1       A
1 2016-07-01   2       B
2 2016-07-01   3       D
3 2016-07-09   1       K
4 2016-07-09   2       L
5 2016-07-09   3       X
6 2016-07-17   1       Z
7 2016-07-17   2       O
8 2016-07-17   3       T

Original Attempt

Generate some data in same format as OP

idx = pd.date_range('2018-04-01', '2018-05-01', freq='1D') + \
      pd.DateOffset(days=16)
df = pd.DataFrame(idx, columns=['Date'])
df.set_index('Date', inplace=True)
df['ID'] = range(len(idx))
df['product'] = range(12,12+len(idx))

print(df)
            ID  product
Date                   
2018-04-17   0       12
2018-04-18   1       13
...
2018-05-16  29       41
2018-05-17  30       42

Use pd.data_range to pick out only the required intervals

idx = pd.date_range(df.index[0], df.index[-1], freq='7D')
df_seven_days = df.loc[idx].reset_index(drop=False)
df_seven_days.rename(columns={'index': 'Date'}, inplace=True)

print(df_seven_days)
      Date  ID  product
2018-04-17   0       12
2018-04-24   7       19
2018-05-01  14       26
2018-05-08  21       33
2018-05-15  28       40
edesz
  • 11,756
  • 22
  • 75
  • 123
  • Thanks for the response! We have tried this but unfortunately, not resulting in what we are after. So any user id once selected will only be selected again after 7 days of previous occurrence therefore, the selection criteria is dependent on both date and user ID. – Analytics_TM Mar 11 '19 at 23:20
  • If I understand this correctly, you just want to select rows in increments of 7 days. See my **EDIT** and let me know if this is what you want. – edesz Mar 12 '19 at 00:09
  • Thanks for this! But what we want is the id to be selected again only if it’s not present in the next 7 days i.e. date associated with that id + 7 days. Hope this makes it a bit clear Date ID product 2018-04-01 1 A #ID-1 gets selected 2018-04-01 2 A #ID-2 gets selected 2018-04-02 1 A # ID 1 doesn’t get selected 2018-04-01 < 7days 2018-04-10 1 A #ID-1 gets selected 2018-04-11 2 A #ID-2 gets selected – Analytics_TM Mar 12 '19 at 02:17
  • @AnalyticsTeam, please see the updated Answer. (note I have removed the **EDIT** since it was not useful) – edesz Mar 12 '19 at 04:30
0

Do this:

dat = pd.DataFrame([{'Date': '01/02/2016', 'ID': 1, 'product':'A'},
                {'Date': '01/02/2016', 'ID': 2, 'product':'B'},
                {'Date': '01/04/2016', 'ID': 4, 'product':'C'},
                {'Date': '01/05/2016', 'ID': 5, 'product':'D'},
                {'Date': '01/06/2016', 'ID': 6, 'product':'a'},
                {'Date': '01/07/2016', 'ID': 7, 'product':'b'},
                {'Date': '01/09/2016', 'ID': 8, 'product':'d'},
                {'Date': '01/09/2016', 'ID': 9, 'product':'A'},
                {'Date': '01/16/2016', 'ID': 10, 'product':'B'},
                {'Date': '01/18/2016', 'ID': 11, 'product':'C'},
                {'Date': '01/18/2016', 'ID': 12, 'product':'K'},
                {'Date': '01/21/2016', 'ID': 13, 'product':'J'},
                {'Date': '01/25/2016', 'ID': 14, 'product':'I'},
                {'Date': '01/25/2016', 'ID': 15, 'product':'H'},
                {'Date': '01/25/2016', 'ID': 16, 'product':'G'},
                {'Date': '01/30/2016', 'ID': 17, 'product':'F'}]) # Replace with your data
filtered_data = pd.DataFrame(columns=['Date', 'ID', 'product'])
dat.Date = pd.to_datetime(dat.Date, format="%m/%d/%Y") # Change the format as per your data
filtered_data.loc[len(filtered_data)] = dat.iloc[0]

def trans(row):
    if (row.Date - filtered_data.tail(1).iloc[0].Date).days >= 7 or (row.Date - filtered_data.tail(1).iloc[0].Date).days == 0:
        filtered_data.loc[len(filtered_data)] = row

dat.apply(trans, axis=1)
print filtered_data

Result:

        Date  ID product
0 2016-01-02   1       A
1 2016-01-02   1       A
2 2016-01-02   2       B
3 2016-01-09   8       d
4 2016-01-09   9       A
5 2016-01-16  10       B
6 2016-01-25  14       I
7 2016-01-25  15       H
8 2016-01-25  16       G
entropy
  • 840
  • 6
  • 16
  • Thanks @suicidalteddy! row 2 (02-01-2016) shouldn't be selected as the days between these 2 dates is less than 7 for same ID - 1.Please see our required dataset and you'll see ID only gets selected once when days between dates(associated with ID)>=7. Hope this makes sense. – Analytics_TM Mar 12 '19 at 03:19