1

I got a transactional operation that produces a feed like below:


df = pd.DataFrame({'action':['transacted','transacted','transacted','transacted','undo','transacted','transacted','transacted','transacted','transacted','undo','undo','undo','transacted'],
                  'transaction_count':10,20,35,60,60,60,80,90,100,10,10,100,90,90]})
action transaction_count
0 transacted 10
1 transacted 20
2 transacted 35
3 transacted 60
4 undo 60
5 transacted 60
6 transacted 80
7 transacted 90
8 transacted 100
9 transacted 10
10 undo 10
11 undo 100
12 undo 90
13 transacted 90

The counts are in a pattern but not in a linear way. (10-20-35-60-80-90-100-10-20...)

undo states which transaction count is cancelled.

There can be multiple undo's for multiple cancellations.

# This is an initial apply, to set it up
df['is_undone']=df.apply(lambda x: 1 if x['action']=='undo' else 0, axis=1).shift(-1)
df=df.fillna(0)  # For shift

df=df.loc[df['is_undone']==0]
df=df.fillna(0)
df=df.loc[df['action']!='undo']
df.reset_index(drop=True,inplace=True)

Unfortunately, it only works for single undo but not for multiple in a row. Apply does not let accessing neighbour row values and I can't think of any else solution. It should also need to calculate 300k rows, so, performance is also an issue.

Expected result is:

action transaction_count
0 transacted 10
1 transacted 20
2 transacted 35
3 transacted 60
4 transacted 80
5 transacted 90

Thanks in advance!

Caner Bas
  • 31
  • 6

3 Answers3

1

A slight modification of this excellent answer might give you what you want:

Solution

def undo(frame):
    d = {"transacted": 0, "undo": 1}
    condition = frame["action"].map(d).rolling(2,2).apply(lambda x: x.to_list()==[0,1]).mask(lambda x: x==0).bfill(limit=1).fillna(0)
    return frame[condition==0].reset_index(drop=True)
               
result = df.groupby("transaction_count").apply(undo).reset_index(drop=True)
>>> result
       action  transaction_count
0  transacted                 10
1  transacted                 20
2  transacted                 35
3  transacted                 60
4  transacted                 80
5  transacted                 90

Explanation

groupby is being used to deal with each transaction_count separately. As, an example, consider the case where transaction_count is 10.

frame = df[df["transaction_count"]==10]
>>> frame
        action  transaction_count
0   transacted                 10
9   transacted                 10
10        undo                 10

In the undo function, we first map the action column to a number:

>>> frame["action"].map(d)
0     0
9     0
10    1

Recognize that we want to remove rows where 0 (transacted) is immediately followed by 1 (undo). In the above, this corresponds to rows with indices 9 and 10.

To do this, we work with 2 rows of the above frame at a time using pd.rolling and lambda:

>>> frame["action"].map(d).rolling(2,2).apply(lambda x: x.to_list()==[0,1])
0     NaN
9     0.0
10    1.0

Now, mask 0s to np.nan, bfill (back-fill) exactly once, and fillna with 0.

>>> frame["action"].map(d).rolling(2,2).apply(lambda x: x.to_list()==[0,1]).mask(lambda x: x==0).bfill(limit=1).fillna(0)
0     0.0
9     1.0
10    1.0
Name: action, dtype: float64

From the above, we need all rows that are not equal to 1. This is what is returned in the undo function.

not_speshal
  • 22,093
  • 2
  • 15
  • 30
  • This works as well, however, my pandas knowledge could not handle what is going on with the code. I will study this one as well. Thanks! – Caner Bas Jun 29 '21 at 18:35
  • 1
    It is not the most "readable" code but once you break it down and see what's happening at every `.` in the undo function, it will make perfect sense :) – not_speshal Jun 29 '21 at 18:35
  • CPU times: user 33 ms, sys: 0 ns, total: 33 ms Wall time: 30.4 ms I believe chain functions slowed it a bit down, others are faster – Caner Bas Jun 29 '21 at 18:41
  • 1
    Added a line-by-line explanation which might help you walk through and understand what is happening. – not_speshal Jun 29 '21 at 18:49
1

If transaction count is unique within a block. Groups can be created with:

df['group'] = (df['action'].eq('transacted') &
               df['action'].shift().eq('undo')).cumsum()
        action  transaction_count  group
0   transacted                 10      0
1   transacted                 20      0
2   transacted                 35      0
3   transacted                 60      0
4         undo                 60      0
5   transacted                 60      1
6   transacted                 80      1
7   transacted                 90      1
8   transacted                100      1
9   transacted                 10      1
10        undo                 10      1
11        undo                100      1
12        undo                 90      1
13  transacted                 90      2

Then drop_duplicates can be used to remove duplicate transaction_count per group:

df = (df.drop_duplicates(['transaction_count', 'group'], keep=False)
      .drop('group', axis=1)
      .reset_index(drop=True))

df:

       action  transaction_count
0  transacted                 10
1  transacted                 20
2  transacted                 35
3  transacted                 60
4  transacted                 80
5  transacted                 90

If not, groups can be created using an inversed pathing to create like ids which associate transactions and undos:

s = df['action'].shift()
m = df['action'].ne(s).cumsum()
df['group'] = (df['action'].eq('transacted') & s.eq('undo')).cumsum()
df['new'] = (
    df.groupby(['action', m]).cumcount()
        .mask(df['action'].eq('transacted'),
              df.loc[::-1].groupby(['action', m]).cumcount())
)
        action  transaction_count  group  new
0   transacted                 10      0    3
1   transacted                 20      0    2
2   transacted                 35      0    1
3   transacted                 60      0    0
4         undo                 60      0    0
5   transacted                 60      1    4
6   transacted                 80      1    3
7   transacted                 90      1    2  # Matches Undo 2
8   transacted                100      1    1  # Matches Undo 1
9   transacted                 10      1    0  # Matches Undo 0
10        undo                 10      1    0  # Undo 0
11        undo                100      1    1  # Undo 1
12        undo                 90      1    2  # Undo 2
13  transacted                 90      2    0

Then duplicates can be dropped among group and new:

df = (df
      .drop_duplicates(['group', 'new'], keep=False)
      .drop(['group', 'new'], axis=1)
      .reset_index(drop=True))
       action  transaction_count
0  transacted                 10
1  transacted                 20
2  transacted                 35
3  transacted                 60
4  transacted                 80
5  transacted                 90
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
  • Inversed pathing solution worked like a charm for the case. In the meanwhile, I also found a solution which I will post just in a minute. I would love to hear your opinion! Thanks a lot – Caner Bas Jun 29 '21 at 18:18
  • CPU times: user 16.6 ms, sys: 0 ns, total: 16.6 ms Wall time: 14.8 ms -- for the second part of the solution on the example df – Caner Bas Jun 29 '21 at 18:41
  • Yeah, this would be expensive on a small dataset. On a dataset of 1000 rows this ran much faster than the other two options. – Henry Ecker Jun 29 '21 at 18:45
  • on 41k rows: This one: CPU times: user 44.4 ms, sys: 0 ns, total: 44.4 ms Wall time: 43.2 ms The one I posted: CPU times: user 106 ms, sys: 5 µs, total: 106 ms Wall time: 105 ms 50% off. Nice : ) – Caner Bas Jun 29 '21 at 18:53
  • I thought that would be the case. Thank you for the confirmation. =) – Henry Ecker Jun 29 '21 at 18:55
1
df['is_undone']=0
for k, v in df.groupby((df['action'].shift() != df['action']).cumsum()):
    if v['action'].max()=='undo':
        df.loc[v['action'].index[0]-v['action'].count():v['action'].index[v['action'].count()-1],'is_undone']=1
df=df.loc[df['is_undone']==0]
df.drop('is_undone',axis=1,inplace=True)
df.reset_index(drop=True,inplace=True)

I found this article to have consecutive code blocks are grouped. The code runs in less than a second which is more than enough for me.

Caner Bas
  • 31
  • 6
  • 1
    Seems great! The ranging operation looks good. On small datasets this seems to run faster than mine does. I'm going to run some tests on larger sets. – Henry Ecker Jun 29 '21 at 18:38
  • CPU times: user 10.4 ms, sys: 0 ns, total: 10.4 ms Wall time: 9.32 ms - with the example dataset – Caner Bas Jun 29 '21 at 18:42