Flagging a row according to the various amount of neighbour values in a pandas dataframe

Question

I got a transactional operation that produces a feed like below:


df = pd.DataFrame({'action':['transacted','transacted','transacted','transacted','undo','transacted','transacted','transacted','transacted','transacted','undo','undo','undo','transacted'],
                  'transaction_count':10,20,35,60,60,60,80,90,100,10,10,100,90,90]})

	action	transaction_count
0	transacted	10
1	transacted	20
2	transacted	35
3	transacted	60
4	undo	60
5	transacted	60
6	transacted	80
7	transacted	90
8	transacted	100
9	transacted	10
10	undo	10
11	undo	100
12	undo	90
13	transacted	90

The counts are in a pattern but not in a linear way. (10-20-35-60-80-90-100-10-20...)

undo states which transaction count is cancelled.

There can be multiple undo's for multiple cancellations.

# This is an initial apply, to set it up
df['is_undone']=df.apply(lambda x: 1 if x['action']=='undo' else 0, axis=1).shift(-1)
df=df.fillna(0)  # For shift

df=df.loc[df['is_undone']==0]
df=df.fillna(0)
df=df.loc[df['action']!='undo']
df.reset_index(drop=True,inplace=True)

Unfortunately, it only works for single undo but not for multiple in a row. Apply does not let accessing neighbour row values and I can't think of any else solution. It should also need to calculate 300k rows, so, performance is also an issue.

Expected result is:

	action	transaction_count
0	transacted	10
1	transacted	20
2	transacted	35
3	transacted	60
4	transacted	80
5	transacted	90

Thanks in advance!

not_speshal · Accepted Answer · 2021-06-29T18:49:03.813

A slight modification of this excellent answer might give you what you want:

Solution

def undo(frame):
    d = {"transacted": 0, "undo": 1}
    condition = frame["action"].map(d).rolling(2,2).apply(lambda x: x.to_list()==[0,1]).mask(lambda x: x==0).bfill(limit=1).fillna(0)
    return frame[condition==0].reset_index(drop=True)
               
result = df.groupby("transaction_count").apply(undo).reset_index(drop=True)
>>> result
       action  transaction_count
0  transacted                 10
1  transacted                 20
2  transacted                 35
3  transacted                 60
4  transacted                 80
5  transacted                 90

Explanation

groupby is being used to deal with each transaction_count separately. As, an example, consider the case where transaction_count is 10.

frame = df[df["transaction_count"]==10]
>>> frame
        action  transaction_count
0   transacted                 10
9   transacted                 10
10        undo                 10

In the undo function, we first map the action column to a number:

>>> frame["action"].map(d)
0     0
9     0
10    1

Recognize that we want to remove rows where 0 (transacted) is immediately followed by 1 (undo). In the above, this corresponds to rows with indices 9 and 10.

To do this, we work with 2 rows of the above frame at a time using pd.rolling and lambda:

>>> frame["action"].map(d).rolling(2,2).apply(lambda x: x.to_list()==[0,1])
0     NaN
9     0.0
10    1.0

Now, mask 0s to np.nan, bfill (back-fill) exactly once, and fillna with 0.

>>> frame["action"].map(d).rolling(2,2).apply(lambda x: x.to_list()==[0,1]).mask(lambda x: x==0).bfill(limit=1).fillna(0)
0     0.0
9     1.0
10    1.0
Name: action, dtype: float64

From the above, we need all rows that are not equal to 1. This is what is returned in the undo function.

This works as well, however, my pandas knowledge could not handle what is going on with the code. I will study this one as well. Thanks! — Caner Bas, Jun 29 '21 at 18:35
It is not the most "readable" code but once you break it down and see what's happening at every `.` in the undo function, it will make perfect sense :) — not_speshal, Jun 29 '21 at 18:35
CPU times: user 33 ms, sys: 0 ns, total: 33 ms Wall time: 30.4 ms I believe chain functions slowed it a bit down, others are faster — Caner Bas, Jun 29 '21 at 18:41
Added a line-by-line explanation which might help you walk through and understand what is happening. — not_speshal, Jun 29 '21 at 18:49

score 1 · Answer 2 · answered Jun 29 '21 at 17:24

If transaction count is unique within a block. Groups can be created with:

df['group'] = (df['action'].eq('transacted') &
               df['action'].shift().eq('undo')).cumsum()

        action  transaction_count  group
0   transacted                 10      0
1   transacted                 20      0
2   transacted                 35      0
3   transacted                 60      0
4         undo                 60      0
5   transacted                 60      1
6   transacted                 80      1
7   transacted                 90      1
8   transacted                100      1
9   transacted                 10      1
10        undo                 10      1
11        undo                100      1
12        undo                 90      1
13  transacted                 90      2

Then drop_duplicates can be used to remove duplicate transaction_count per group:

df = (df.drop_duplicates(['transaction_count', 'group'], keep=False)
      .drop('group', axis=1)
      .reset_index(drop=True))

df:

       action  transaction_count
0  transacted                 10
1  transacted                 20
2  transacted                 35
3  transacted                 60
4  transacted                 80
5  transacted                 90

If not, groups can be created using an inversed pathing to create like ids which associate transactions and undos:

s = df['action'].shift()
m = df['action'].ne(s).cumsum()
df['group'] = (df['action'].eq('transacted') & s.eq('undo')).cumsum()
df['new'] = (
    df.groupby(['action', m]).cumcount()
        .mask(df['action'].eq('transacted'),
              df.loc[::-1].groupby(['action', m]).cumcount())
)

        action  transaction_count  group  new
0   transacted                 10      0    3
1   transacted                 20      0    2
2   transacted                 35      0    1
3   transacted                 60      0    0
4         undo                 60      0    0
5   transacted                 60      1    4
6   transacted                 80      1    3
7   transacted                 90      1    2  # Matches Undo 2
8   transacted                100      1    1  # Matches Undo 1
9   transacted                 10      1    0  # Matches Undo 0
10        undo                 10      1    0  # Undo 0
11        undo                100      1    1  # Undo 1
12        undo                 90      1    2  # Undo 2
13  transacted                 90      2    0

Then duplicates can be dropped among group and new:

df = (df
      .drop_duplicates(['group', 'new'], keep=False)
      .drop(['group', 'new'], axis=1)
      .reset_index(drop=True))

       action  transaction_count
0  transacted                 10
1  transacted                 20
2  transacted                 35
3  transacted                 60
4  transacted                 80
5  transacted                 90

Inversed pathing solution worked like a charm for the case. In the meanwhile, I also found a solution which I will post just in a minute. I would love to hear your opinion! Thanks a lot — Caner Bas, Jun 29 '21 at 18:18
CPU times: user 16.6 ms, sys: 0 ns, total: 16.6 ms Wall time: 14.8 ms -- for the second part of the solution on the example df — Caner Bas, Jun 29 '21 at 18:41
Yeah, this would be expensive on a small dataset. On a dataset of 1000 rows this ran much faster than the other two options. — Henry Ecker, Jun 29 '21 at 18:45
on 41k rows: This one: CPU times: user 44.4 ms, sys: 0 ns, total: 44.4 ms Wall time: 43.2 ms The one I posted: CPU times: user 106 ms, sys: 5 µs, total: 106 ms Wall time: 105 ms 50% off. Nice : ) — Caner Bas, Jun 29 '21 at 18:53
I thought that would be the case. Thank you for the confirmation. =) — Henry Ecker, Jun 29 '21 at 18:55

score 1 · Answer 3 · answered Jun 29 '21 at 18:30

1

df['is_undone']=0
for k, v in df.groupby((df['action'].shift() != df['action']).cumsum()):
    if v['action'].max()=='undo':
        df.loc[v['action'].index[0]-v['action'].count():v['action'].index[v['action'].count()-1],'is_undone']=1
df=df.loc[df['is_undone']==0]
df.drop('is_undone',axis=1,inplace=True)
df.reset_index(drop=True,inplace=True)

I found this article to have consecutive code blocks are grouped. The code runs in less than a second which is more than enough for me.

answered Jun 29 '21 at 18:30

Caner Bas

31
6

1

Seems great! The ranging operation looks good. On small datasets this seems to run faster than mine does. I'm going to run some tests on larger sets. – Henry Ecker Jun 29 '21 at 18:38
CPU times: user 10.4 ms, sys: 0 ns, total: 10.4 ms Wall time: 9.32 ms - with the example dataset – Caner Bas Jun 29 '21 at 18:42

Flagging a row according to the various amount of neighbour values in a pandas dataframe

3 Answers3

Solution

Explanation