6

This is an example of the data frame i'm working with:

d = {
'item_number':['bdsm1000', 'bdsm1000', 'bdsm1000', 'ZZRWB18','ZZRWB18', 'ZZRWB18', 'ZZRWB18', 'ZZHP1427BLK', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1414', 'ZZHP1414', 'ZZHP1414', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE'],
'Comp_ID':[2454, 2454, 2454, 1395, 1395, 1395, 1395, 3378, 1266941, 660867, 43978, 1266941, 660867, 43978, 1266941, 660867, 43978, 1266941, 660867, 43978, 43978, 43978, 43978, 1197347907, 70745, 4737, 1197347907, 4737, 1197347907, 70745, 4737, 1197347907, 70745, 4737, 1197347907, 4737, 1197487704, 1197347907, 70745, 23872, 4737, 1197347907, 4737, 1197487704, 1197347907, 23872, 4737, 1197487704, 1197347907, 70745],
'date':['2016-11-22', '2016-11-20', '2016-11-19', '2016-11-22', '2016-11-20', '2016-11-19', '2016-11-18', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-20', '2016-11-20', '2016-11-20', '2016-11-19', '2016-11-19', '2016-11-19', '2016-11-18', '2016-11-18', '2016-11-18', '2016-11-22', '2016-11-20', '2016-11-19', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-21', '2016-11-21', '2016-11-20', '2016-11-20', '2016-11-20', '2016-11-19', '2016-11-19', '2016-11-19', '2016-11-18', '2016-11-18', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-21', '2016-11-21', '2016-11-20', '2016-11-20', '2016-11-20', '2016-11-20', '2016-11-19', '2016-11-19', '2016-11-19']}

df = pd.DataFrame(data=d)
df.date = pd.to_datetime(df.date)

I'd like to count consecutive observations starting from 2016-11-22 that there are grouped by Comp_ID and item_number.

Essentially, what I am looking to do, is count how many days in a row there is an observation counting back from todays date for each Comp_ID and item_number. (this example was put together on the 22nd of Nov) Consecutive observations observed weeks/ days prior to today are not relevant. Only sequences like today... yesterday... the day before yesterday... and so on are relevant.

I got this to work on a smaller sample, but it seems to be getting tripped up on a larger data-set.

Here is the code for the smaller sample. I need to find the consecutive dates with observations across thousands of sellers/ items. For some reason, the below code did not work on the larger data set.

d = {'item_number':['KIN005','KIN005','KIN005','KIN005','KIN005','A789B','A789B','A789B','G123H','G123H','G123H'],
'Comp_ID':['1395','1395','1395','1395','1395','7787','7787','7787','1395','1395','1395'],
'date':['2016-11-22','2016-11-21','2016-11-20','2016-11-14','2016-11-13','2016-11-22','2016-11-21','2016-11-12','2016-11-22','2016-11-21','2016-11-08']}

df = pd.DataFrame(data=d)
df.date = pd.to_datetime(df.date)
d = pd.Timedelta(1, 'D')

df = df.sort_values(['item_number','date','Comp_ID'],ascending=False)

g = df.groupby(['Comp_ID','item_number'])
sequence = g['date'].apply(lambda x: x.diff().fillna(0).abs().le(d)).reset_index()
sequence.set_index('index',inplace=True)
test = df.join(sequence)
test.columns = ['Comp_ID','date','item_number','consecutive']
g = test.groupby(['Comp_ID','item_number'])
g['consecutive'].apply(lambda x: x.idxmin() - x.idxmax() )

This gets the desired result for the smaller data-set:

Comp_ID  item_number
1395     G123H          2
         KIN005         3
7787     KIN005         2
Name: consecutive, dtype: int64
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
Yale Newman
  • 1,141
  • 1
  • 13
  • 22

2 Answers2

5

You can do it this way:

today = pd.to_datetime('2016-11-22')

# sort DF by `date` (descending)    
x = df.sort_values('date', ascending=0)
g = x.groupby(['Comp_ID','item_number'])
# compare the # of days to `today` with a consecutive day# in each group
x[(today - x['date']).dt.days == g.cumcount()].groupby(['Comp_ID','item_number']).size()

Result:

Comp_ID  item_number
1395     G123H          2
         KIN005         3
7787     A789B          2
dtype: int64

PS thanks to @DataSwede's for faster diff calculation!

Explanation:

In [124]: x[(today - x['date']).dt.days == g.cumcount()] \
           .sort_values(['Comp_ID','item_number','date'], ascending=[1,1,0])
Out[124]:
  Comp_ID       date item_number
8    1395 2016-11-22       G123H
9    1395 2016-11-21       G123H
0    1395 2016-11-22      KIN005
1    1395 2016-11-21      KIN005
2    1395 2016-11-20      KIN005
5    7787 2016-11-22       A789B
6    7787 2016-11-21       A789B
Community
  • 1
  • 1
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • 1
    At the calculate date diff line, is there a reason you chose to use the apply method over the groupby? I'm getting a slight performance gain, and the same output by calculating the diff using the dataframe-- `x['diff'] = (today - x['date']).dt.days`, but wanted to see if there's a reason why using apply is a better option. – DataSwede Nov 27 '16 at 20:27
  • @DataSwede, yeah, good catch, thank you! In the case when are calculating difference between a constant date (`today`) and the series - it'll work properly. First i tried to calculate `diff` within each group and sticked to that slow variant... – MaxU - stand with Ukraine Nov 27 '16 at 20:30
  • @MaxU I ran into some errors with the solution that I had been using and switched over to this one and its working perfectly! I'm kind of a beginner at this so if you could add a bit more of a step by step explanation of what's happening when you're comparing number of days to that groupby object it'd be really appreciated! would love to actually learn whats happening – Yale Newman Jan 11 '17 at 21:18
  • @YaleNewman, i think the best way to understand it is to run all small pieces and see the output on each step... – MaxU - stand with Ukraine Jan 11 '17 at 21:26
3

First I will propose that we yield a series of dates, each 1 day less than the prior...

import datetime
import pandas as pd

def gen_prior_date(start_date):
    yield start_date
    while True:
        start_date -= datetime.timedelta(days=1)
        yield start_date

...

>>> start_date = datetime.date(2016, 11, 22)
>>> back_in_time = gen_prior_date(start_date)
>>> next(back_in_time)
datetime.date(2016, 11, 22)
>>> next(back_in_time)
datetime.date(2016, 11, 21)

Now we need a function we can apply to each group...

def count_consec_dates(dates, start_date):
    dates = pd.to_datetime(dates.values).date
    dates_set = set(dates)   # O(1) vs O(n) lookup times
    back_in_time = gen_prior_date(start_date)

    tally = 0
    while next(back_in_time) in dates_set:  # jump out on first miss
        tally += 1
    return tally

The rest is easy...

>>> small_data = {'item_number': ['KIN005','KIN005','KIN005','KIN005','KIN005','A789B','A789B','A789B','G123H','G123H','G123H'],
...               'Comp_ID': ['1395','1395','1395','1395','1395','7787','7787','7787','1395','1395','1395'],
...               'date': ['2016-11-22','2016-11-21','2016-11-20','2016-11-14','2016-11-13','2016-11-22','2016-11-21','2016-11-12','2016-11-22','2016-11-21','2016-11-08']}
>>> small_df = pd.DataFrame(data=small_data)
>>> start_date = datetime.date(2016, 11, 22)
>>> groups = small_df.groupby(['Comp_ID', 'item_number']).date
>>> groups.apply(lambda x: count_consec_dates(x, start_date))
Comp_ID  item_number
1395     G123H          2
         KIN005         3
7787     A789B          2
Bruce Pucci
  • 1,821
  • 2
  • 19
  • 26
  • I was able to get the data set that you worked with to work, but the question was geared towards the larger set of data. The actual data-set I'm working with has thousands of sellers and items across a month of dates. – Yale Newman Nov 26 '16 at 01:03
  • Is space or runtime a problem? I did leave the numpy space with this answer but I was trying to avoid expensive operations like join or sort. – Bruce Pucci Nov 26 '16 at 01:30
  • Looks like it is running on the larger data set! Awesome! Honestly, still a bit over my head, so if you could throw an explanation into what's going on step by step, that'd be really helpful. Also, if you could save me some time and make the start date a dynamic field that'd be great. I'd imagine its start_date = datetime.datetime.now().however you change it to (year, month,day) – Yale Newman Nov 26 '16 at 01:32
  • I will write something up later. You can pass whatever you want into the count_consec_dates function as a start_date. – Bruce Pucci Nov 26 '16 at 01:35
  • So I don't know where is the correct forum to explain what is going on so I will do it here. I have a feeling where you are getting tripped up is the gen_prior_date function. Generator functions (like this) are an awesome feature of the language. It's definitely worth [reading](http://stackoverflow.com/questions/1756096/understanding-generators-in-python) about and trying to truly understand. Sorry if you already have an understanding of this. The key thing to remember is that generators are very efficient for this type of thing as an entire Series or list doesn't have to be created. – Bruce Pucci Nov 26 '16 at 04:36
  • The other function, count_consec_dates is easy to understand. The first line is just doing a type conversion from str to datetime.date. The next line is turning the Series into a set because checks for inclusion are much faster on a set. The next line is setting up gen_prior_date. Now next() can be called on the return value, back_in_time, to get the next value from the generator. The last part is simply a counter. Essentially asking "how many dates does gen_prior_date have to generate before it yields a date not in the given iterable?" – Bruce Pucci Nov 26 '16 at 04:42
  • Again I would make sure that I understand what gen_prior_date is doing. The rest should be gravy based on what you have in your original question. – Bruce Pucci Nov 26 '16 at 04:44
  • Appreciate the additional insight here! Definitely picked up a few things that will be very helpful:) – Yale Newman Nov 26 '16 at 06:13
  • No problem. Can you accept the answer? I think that helps with the Google juice for the next person looking for this type of thing. – Bruce Pucci Nov 26 '16 at 14:45