1

My goal is to generate the following row_number, called transaction_in_row

enter image description here

The row_number should reset based on (partition_by) customer and has_transaction column. My issue is on the yellow column, the row_number function in SQL will return 3 instead of 1.

This my current SQL code

row_number() over(partition by customer, has_transaction order by month asc) as transaction_in_row

Because I'm stuck in the SQL, I'm trying to find a way to do this in Python Dataframe instead. My thinking is to loop manually per customer and per month, but this will be painfully slow as I'm handling ~30 million rows.

Anyone can help me on a more efficient way to do this?

wjandrea
  • 28,235
  • 9
  • 60
  • 81
Blaze Tama
  • 10,828
  • 13
  • 69
  • 129
  • [Please don't post pictures of text](https://meta.stackoverflow.com/q/285551/4518341). Copy-paste the text itself and use the formatting tools like [code formatting](/editing-help#code). BTW, if it helps: [How to make good reproducible pandas examples](/q/20109391/4518341) – wjandrea Jun 19 '23 at 02:16

1 Answers1

2

One way, is using the groupby(cumsum).cumsum trick. Which is harder than in the different sources you could find about this, because you need also to do this per customer.

Here is one version (with lot of intermediary result, rather than trying a one-liner. It may be possible to do it in one-line, but that is more for showing off — even if I often do so —, and here is a bit tricker than just joining lines). Still there are only 4 or 5 lines of code (rest are comments)

import pandas as pd
# Just an example dataframe. You shoud have provided this [mre]. Plus
# I add a second customer `y` to illustrate the "by customer" part
# Note that the first `y` value is True as as is the preceeding line, that
# belongs to `x`. And the second `y` value is True, when 
# previous `x` value is False.
# so, it is interesting to check what occur with those 2: 
# we expect the first `y` to be detected as a value change (since it is
# the first value). And we expect the second `y` to be detected as a
# `same as before` case, even tho it is a change from row before, 
# but not from the last row with the same customer
df=pd.DataFrame({'customer':['x','x','x','x','x','x','y','x','y','y','y'], 'has_transaction':[True,True,False,False,True,True,True,False,True,True,False]})

# A groupby object to perform computations per customer
gbcust=df.has_transaction.groupby(df.customer)

# A table of True/False value saying if a row has changed has_transaction
# value compared to the previous row of the same customer
change=gbcust.shift(1)!=gbcust.shift(0)

# The trick is to `.cumsum` the value in this "change"
# Each time we have a change, the .cumsum result increases by 1. 
# Each row that is the same as the previous, keep also the same cumsum as 
# the previous
# This way, we get a sort of "ID" of packets of identical rows
# But we must do this customer by customer
ids = change.groupby(df.customer).cumsum()

# The target column. We put it conveniently to 1, initially, to be able do
# `cumsum` it
df['transaction_in_row']=1

# Now, all we have to do is to accumulate the count of row in each group of ids/customer 
df['transaction_in_row'] = df['transaction_in_row'].groupby([ids, df.customer]).cumsum()

Intermediary values for this example

The dataframe :

customer has_transaction
0 x True
1 x True
2 x False
3 x False
4 x True
5 x True
6 y True
7 x False
8 y True
9 y True
10 y False

Content of change variable (presented here as if it were added to the dataframe, just for lisibility). Change is True for each row that has a new has_transaction value for the same customer.

customer has_transaction change
0 x True True
1 x True False
2 x False True
3 x False False
4 x True True
5 x True False
6 y True True
7 x False True
8 y True False
9 y True False
10 y False True

Hence the ids, which is the cumsum of this, that is a different number for each group of identical has_transaction (for a given customer)

customer has_transaction ids
0 x True 1
1 x True 1
2 x False 2
3 x False 2
4 x True 3
5 x True 3
6 y True 1
7 x False 4
8 y True 1
9 y True 1
10 y False 2

Note that this id must be understood "by customer". So for x, we have groups 1,2,3,4, and for customer y, we have groups 1 and 2.

And the final result, which is the counting of rows in each id/customer group

customer has_transaction transaction_in_row
0 x True 1
1 x True 2
2 x False 1
3 x False 2
4 x True 1
5 x True 2
6 y True 1
7 x False 1
8 y True 2
9 y True 3
10 y False 1

I am pretty sure there are some smarter, and even faster pure pandas ways to do that (with less intermediary result). But that is linear, and it is impossible to do better than linear for this problem. And that is vectorized : no for loop, no apply of python functions.

So, if there is a better solution, I bet it can't be a lot more than twice faster (nothing compared to time factor 100 or 1000 that we typically get when we remove for loops and apply from pandas codes)

The best way, that being said, would probably to be using C extension, or numba, to create an ad-hoc numpy ufunc or pandas accumulator.

chrslg
  • 9,023
  • 5
  • 17
  • 31