0

I've been working on this for sometime now and can't seem to work it out. I'm working with a multi-indexed dataframe with 2 levels which look like the following:

def data():
    data = {'date': pd.Series(['2016-1-1', '2016-1-1', '2016-1-1', 
                               '2016-1-2', '2016-1-2', '2016-1-2', 
                               '2016-1-3', '2016-1-3', '2016-1-3',
                               '2016-1-4', '2016-1-4', '2016-1-4',
                               '2016-1-5', '2016-1-5', '2016-1-5']),
            'group': pd.Series(['groupA', 'groupB', 'groupC', 
                                'groupA', 'groupB', 'groupC', 
                                'groupA', 'groupB', 'groupC',
                                'groupA', 'groupB', 'groupC',
                                'groupA', 'groupB', 'groupC']),
            'series1': pd.Series([1, 2, 3, 
                                  1, 2, 3, 
                                  1, 2, 3,
                                  1, 3, 4,
                                  2, 3, 4]),
           'series2': pd.Series([1, 3, 4, 
                                 2, 3, 3, 
                                 2, 4, 2,
                                 1, 2, 3,
                                 1, 2, 3])}
    df = pd.DataFrame(data)
    df['date'] = pd.to_datetime(df['date'])
    df.set_index(['date', 'group'], inplace=True)
return df

I have a column that specifies 1 of 3 conditions. There may be a more succinct way to write this portion of the code, but this is not my issue.

def add_cond(df):
    df['1minus2'] = df['series1'] - df['series2']   
    # first condition
    mask1 = df['series1'] < df['series2']
    df.loc[mask1, 'result'] = 'less'
    # second condition
    mask2 = df['series1'] > df['series2']
    df.loc[mask2, 'result'] = 'greater'
    # third condition
    mask3 = df['series1'] == df['series2']
    df.loc[mask3, 'result'] = 'equal'
return df

My issue is that I want to add a column which counts the number of daily consecutive conditions. I've tried several different implementations of groupby and cumcount and I'm able to get the cumulative counts of all conditions but I want them to reset when the date index is not consecutive.

Below I've listed some related posts that I've tried. I thought the second answer in Pandas: conditional rolling count would work but it uses a transform method which doesn't seem to work here since I have multiple columns.

Following the strategy outlined in the post Finding consecutive segments in a pandas data frame I created the code below which creates a Series of numpy arrays containing the date and group index values and data from the "result" column. I'm thinking I can slice this df in such a way that will allow me to count each consecutive group and merge those results back into the original df.

df1 = df.reset_index(level=['date','group']).groupby(['result']).apply(np.array)

The output would look like this.

1
1
1
1
2
1
2
3
1
1
1
2
1
2
3

It's a bit hard to see whether the cumulative consecutive conditions are being met with the hierarchical df structure but if I unstack the df it's much easier to see what I'm trying to achieve. Maybe there's a way to use unstack to orient the data in such a way that will give me the result I'm looking for?

df['result'].groupby(['date','group']).head().unstack()

Out[9]: 
group        groupA   groupB   groupC
date                                 
2016-01-01    equal     less     less
2016-01-02     less     less    equal
2016-01-03     less     less  greater
2016-01-04    equal  greater  greater
2016-01-05  greater  greater  greater
Greg
  • 43
  • 1
  • 6
  • I think you can be a lot more clear about the exact mechanics you're looking for. I'm not sure what the `date` column has to do with it (is it relevant)? Also, note that `np.sign(series1 - series2)` is a better way of doing `add_cond()` (if that's your real condition). – John Zwinck Jul 02 '17 at 01:53
  • Thanks @John Zwinck I'll see how I can clarify. I had used np.sign in my original code but didn't like how zero evaluates as a sign change and couldn't get it to work they way I wanted when using it within lambda. – Greg Jul 02 '17 at 02:55

1 Answers1

2

Let's use this approach with groupby and cumcount:

df = data()
df1 = add_con(df)
df1['Consec'] = df1[['result']].groupby(['group','result']).cumcount() + 1

Output:

                   series1  series2  1minus2   result  Consec
date       group                                             
2016-01-01 groupA        1        1        0    equal       1
           groupB        2        3       -1     less       1
           groupC        3        4       -1     less       1
2016-01-02 groupA        1        2       -1     less       1
           groupB        2        3       -1     less       2
           groupC        3        3        0    equal       1
2016-01-03 groupA        1        2       -1     less       2
           groupB        2        4       -2     less       3
           groupC        3        2        1  greater       1
2016-01-04 groupA        1        1        0    equal       2
           groupB        3        2        1  greater       1
           groupC        4        3        1  greater       2
2016-01-05 groupA        2        1        1  greater       1
           groupB        3        2        1  greater       2
           groupC        4        3        1  greater       3
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • Why not just `df1['Consec'] = df1[['result']].groupby...`? No need for `assign` - it just makes the code more verbose and less efficient. – John Zwinck Jul 02 '17 at 03:47
  • Thanks @ScottBoston but when I run it as written I get a `KeyError: group` error. It runs if I rearrange it a bit and put ...`groupby(level='group')['result']`... instead but then the results are wrong. Any idea? – Greg Jul 02 '17 at 04:15
  • Ok, I'll upgrade my version. Just checked and I'm at 0.18.1. Thanks for your help! – Greg Jul 02 '17 at 04:31
  • @ScottBoston got my pandas updated this morning...had to sleep on this answer but I noticed it still doesn't operate as intended. For instance, at index location `2016-01-04 groupA` the result column is "equal" yet it's returning the cumulative of all `groupA equal` of 2 but should be 1. This is the part I've been struggling with. – Greg Jul 02 '17 at 17:46
  • True but I'm trying to get a count of the daily consecutive conditions. So since 2016-01-03 isn't "equal" for groupA then 2016-01-04 should start over at 1. – Greg Jul 02 '17 at 18:04