Identifying differences between groups in pandas dataframe

Question

I have a pandas dataframe indexed by date and and ID. I would like to:

Identify the ID of additions and deletions between dates
Add the ID to another dataframe with the date of the addition/deletion.

date        ID   value
12/31/2010  13  -0.124409
             9   0.555959
             1  -0.705634
             2  -3.123603
             4   0.725009
1/31/2011   13   0.471078
             9   0.276006
             1  -0.468463
            22   1.076821
            11   0.668599

Desired output:

date        ID  flag
1/31/2011   22  addition
1/31/2011   11  addition
1/31/2011   2   deletion
1/31/2011   4   deletion

I have tried Diff between two dataframes in pandas . I cannot get this to work on a grouped dataframe. I am unsure how to loop over each group, and compare to the previous group.

score 1 · Answer 1 · answered Mar 19 '18 at 18:15

1

You can using duplicated, to find the distinct value

s=df[~df.index.get_level_values(1).duplicated(keep=False)]
pd.DataFrame({'date':['1/31/2011']*len(s),'ID':s.index.get_level_values(1),'flag':(s.index.get_level_values(0)=='1/31/2011')}).replace({False:'deletion',True:'addition'})
Out[529]: 
   ID       date      flag
0   2  1/31/2011  deletion
1   4  1/31/2011  deletion
2  22  1/31/2011  addition
3  11  1/31/2011  addition

answered Mar 19 '18 at 18:15

BENY

317,841
20
164
234

Thank you. The code `s=df[~df.index.get_level_values(1).duplicated(keep=False)]` works. However how can I loop over several dates, ie 100 groups in the dataframe? – akr24 Mar 19 '18 at 18:36

piRSquared · Accepted Answer · 2018-03-19T23:27:32.883

0

I created a helper function that shifts the first level of a pandas.MultiIndex. With this, I can difference it with the original index to determine additions and deletions.

def shift_level(idx):
    level = idx.levels[0]
    mapping = dict(zip(level[:-1], level[1:]))
    idx = idx.set_levels(level.map(mapping.get), 0)
    return idx[idx.get_level_values(0).notna()].remove_unused_levels()

idx = df.index
fidx = shift_level(idx)

additions = fidx.difference(idx)
deletions = idx[idx.labels[0] > 0].difference(fidx)

pd.Series('+', additions).append(
    pd.Series('-', deletions)).rename('flag').reset_index()

        date  ID flag
0 2011-01-31   2    +
1 2011-01-31   4    +
2 2011-01-31  11    -
3 2011-01-31  22    -

edited Mar 19 '18 at 23:27

answered Mar 19 '18 at 19:26

piRSquared

285,575
57
475
624

Works! Only comment is that the additions in your solution get added to the prior month (12/31/2010), instead of the month they were added (1/31/2011). But I can look into modifying. – akr24 Mar 19 '18 at 20:38
This is great -- and also more complex than I anticipated. I'm new to pandas and surprised it doesn't have a native method to compare index levels/groups. – akr24 Mar 20 '18 at 16:18
There are other ways you can do this with unstacking and shifting. But if you end up with any sizable dataframe, this should remain stable from a performance stand point. – piRSquared Mar 20 '18 at 16:59

Identifying differences between groups in pandas dataframe

2 Answers2

Linked