One way, is using the groupby(cumsum).cumsum
trick. Which is harder than in the different sources you could find about this, because you need also to do this per customer.
Here is one version (with lot of intermediary result, rather than trying a one-liner. It may be possible to do it in one-line, but that is more for showing off — even if I often do so —, and here is a bit tricker than just joining lines). Still there are only 4 or 5 lines of code (rest are comments)
import pandas as pd
# Just an example dataframe. You shoud have provided this [mre]. Plus
# I add a second customer `y` to illustrate the "by customer" part
# Note that the first `y` value is True as as is the preceeding line, that
# belongs to `x`. And the second `y` value is True, when
# previous `x` value is False.
# so, it is interesting to check what occur with those 2:
# we expect the first `y` to be detected as a value change (since it is
# the first value). And we expect the second `y` to be detected as a
# `same as before` case, even tho it is a change from row before,
# but not from the last row with the same customer
df=pd.DataFrame({'customer':['x','x','x','x','x','x','y','x','y','y','y'], 'has_transaction':[True,True,False,False,True,True,True,False,True,True,False]})
# A groupby object to perform computations per customer
gbcust=df.has_transaction.groupby(df.customer)
# A table of True/False value saying if a row has changed has_transaction
# value compared to the previous row of the same customer
change=gbcust.shift(1)!=gbcust.shift(0)
# The trick is to `.cumsum` the value in this "change"
# Each time we have a change, the .cumsum result increases by 1.
# Each row that is the same as the previous, keep also the same cumsum as
# the previous
# This way, we get a sort of "ID" of packets of identical rows
# But we must do this customer by customer
ids = change.groupby(df.customer).cumsum()
# The target column. We put it conveniently to 1, initially, to be able do
# `cumsum` it
df['transaction_in_row']=1
# Now, all we have to do is to accumulate the count of row in each group of ids/customer
df['transaction_in_row'] = df['transaction_in_row'].groupby([ids, df.customer]).cumsum()
Intermediary values for this example
The dataframe :
|
customer |
has_transaction |
0 |
x |
True |
1 |
x |
True |
2 |
x |
False |
3 |
x |
False |
4 |
x |
True |
5 |
x |
True |
6 |
y |
True |
7 |
x |
False |
8 |
y |
True |
9 |
y |
True |
10 |
y |
False |
Content of change variable (presented here as if it were added to the dataframe, just for lisibility). Change is True for each row that has a new has_transaction
value for the same customer.
|
customer |
has_transaction |
change |
0 |
x |
True |
True |
1 |
x |
True |
False |
2 |
x |
False |
True |
3 |
x |
False |
False |
4 |
x |
True |
True |
5 |
x |
True |
False |
6 |
y |
True |
True |
7 |
x |
False |
True |
8 |
y |
True |
False |
9 |
y |
True |
False |
10 |
y |
False |
True |
Hence the ids, which is the cumsum of this, that is a different number for each group of identical has_transaction
(for a given customer)
|
customer |
has_transaction |
ids |
0 |
x |
True |
1 |
1 |
x |
True |
1 |
2 |
x |
False |
2 |
3 |
x |
False |
2 |
4 |
x |
True |
3 |
5 |
x |
True |
3 |
6 |
y |
True |
1 |
7 |
x |
False |
4 |
8 |
y |
True |
1 |
9 |
y |
True |
1 |
10 |
y |
False |
2 |
Note that this id must be understood "by customer". So for x, we have groups 1,2,3,4, and for customer y, we have groups 1 and 2.
And the final result, which is the counting of rows in each id/customer group
|
customer |
has_transaction |
transaction_in_row |
0 |
x |
True |
1 |
1 |
x |
True |
2 |
2 |
x |
False |
1 |
3 |
x |
False |
2 |
4 |
x |
True |
1 |
5 |
x |
True |
2 |
6 |
y |
True |
1 |
7 |
x |
False |
1 |
8 |
y |
True |
2 |
9 |
y |
True |
3 |
10 |
y |
False |
1 |
I am pretty sure there are some smarter, and even faster pure pandas ways to do that (with less intermediary result).
But that is linear, and it is impossible to do better than linear for this problem. And that is vectorized : no for loop, no apply
of python functions.
So, if there is a better solution, I bet it can't be a lot more than twice faster (nothing compared to time factor 100 or 1000 that we typically get when we remove for
loops and apply
from pandas codes)
The best way, that being said, would probably to be using C extension, or numba, to create an ad-hoc numpy ufunc or pandas accumulator.