3

I have a dataframe (call it txn_df) that contains monetary transaction records, here are the significant columns in this problem:

txn_year    txn_month   custid  withdraw    deposit
2011        4           123     0.0         100.0
2011        5           123     0.0         0.0
2011        6           123     0.0         0.0
2011        7           123     50.1        0.0
2011        8           123     0.0         0.0

Assume also that we have multiple customers here. withdraw and deposit 0.0 value for both means no transaction has taken place. What I want to do is to produce a new column that indicates how many months has occurred since there was a transaction. Something similar to this:

txn_year    txn_month   custid  withdraw    deposit     num_months_since_last_txn
2011        4           123     0.0         100.0       0
2011        5           123     0.0         0.0         1
2011        6           123     0.0         0.0         2           
2011        7           123     50.1        0.0         3
2011        8           123     0.0         0.0         1

The only solution so far that I can think of is to produce a new column has_txn (which is either 1/0 or True/False) when either one of withdraw and deposit has value > 0.0 but I can't continue from there.

oikonomiyaki
  • 7,691
  • 15
  • 62
  • 101

1 Answers1

5

one way to solve this problem,

df['series'] =  df[['withdraw','deposit']].ne(0).sum(axis=1)
m = df['series']>=1

As @Chris A commented,

m = df[['withdraw','deposit']].gt(0).any(axis=1) #replacement for above snippet,

df['num_months_since_last_txn'] = df.groupby(m.cumsum()).cumcount()
df.loc[df['num_months_since_last_txn']==0,'num_months_since_last_txn']=(df['num_months_since_last_txn']+1).shift(1).fillna(0)
print df

Output:

   txn_year  txn_month  custid  withdraw  deposit
0      2011          4     123       0.0    100.0
1      2011          5     123       0.0      0.0
2      2011          6     123       0.0      0.0
3      2011          7     123      50.1      0.0
4      2011          8     123       0.0      0.0
   txn_year  txn_month  custid  withdraw  deposit  num_months_since_last_txn
0      2011          4     123       0.0    100.0                        0.0
1      2011          5     123       0.0      0.0                        1.0
2      2011          6     123       0.0      0.0                        2.0
3      2011          7     123      50.1      0.0                        3.0
4      2011          8     123       0.0      0.0                        1.0

Explanation:

  1. To get transaction happened or not use ne and sum to get values in binary.
  2. when transaction is 1 create the series from 0,1,2...n using groupby, cumsum, cumcount.
  3. rearrange the value for 0 using .loc

Note: May be I have added more complex to solving this problem. But It will give you an idea and approach to solve this problem.

Solution for considering customer Id,

df=df.sort_values(by=['custid','txn_month'])
mask=~df.duplicated(subset=['custid'],keep='first')
m = df[['withdraw','deposit']].gt(0).any(axis=1)
df['num_months_since_last_txn'] = df.groupby(m.cumsum()).cumcount()
df.loc[df['num_months_since_last_txn']==0,'num_months_since_last_txn']=(df['num_months_since_last_txn']+1).shift(1)
df.loc[mask,'num_months_since_last_txn']=0

Sample Input:

   txn_year  txn_month  custid  withdraw  deposit
0      2011          4     123       0.0    100.0
1      2011          5     123       0.0      0.0
2      2011          4    1245       0.0    100.0
3      2011          5    1245       0.0      0.0
4      2011          6     123       0.0      0.0
5      2011          7    1245      50.1      0.0
6      2011          7     123      50.1      0.0
7      2011          8     123       0.0      0.0
8      2011          6    1245       0.0      0.0
9      2011          8    1245       0.0      0.0

Sample Output:

   txn_year  txn_month  custid  withdraw  deposit  num_months_since_last_txn
0      2011          4     123       0.0    100.0                        0.0
1      2011          5     123       0.0      0.0                        1.0
4      2011          6     123       0.0      0.0                        2.0
6      2011          7     123      50.1      0.0                        3.0
7      2011          8     123       0.0      0.0                        1.0
2      2011          4    1245       0.0    100.0                        0.0
3      2011          5    1245       0.0      0.0                        1.0
8      2011          6    1245       0.0      0.0                        2.0
5      2011          7    1245      50.1      0.0                        3.0
9      2011          8    1245       0.0      0.0                        1.0

Explanation for considering Customer ID,

  1. The above code works based on the interval between [1,1]. So to make the same format, sort df by cust_id and txn_month, For future you could add txn_year.
  2. fillna(0), won't be work here, Because shift will not create NaN for next customer. To reset into 0 Find duplication of customer Id and take first value replace that into 0.
Mohamed Thasin ah
  • 10,754
  • 11
  • 52
  • 111
  • 1
    great answer. Suppose the case where there was both 'deposit' AND 'withdrawal' in a month. `df['series']` would be `2` and therefore not picked up by the logic of your boolean mask `m`. You could potentially condense lines 1 and 2 down with: `m = df[['withdraw','deposit']].gt(0).any(axis=1)` – Chris Adams Sep 25 '18 at 10:50
  • 1
    @chris A - you are right. Thanks for the catch. Updated to the solution – Mohamed Thasin ah Sep 25 '18 at 10:52
  • Just curious, does this solution include the grouping by `custid`? – oikonomiyaki Sep 25 '18 at 11:34
  • @menorah84 - No it won't, It's based on position starting from 1 to ending to 1. But you can apply this same after sorting your dataframe by cust id. then you have to replace fillna command with starting of custid – Mohamed Thasin ah Sep 25 '18 at 11:55
  • sorry, to be posting something so late on this post, but is there a way to do this in PySpark? – Manas Jani Jun 14 '19 at 18:13