looping over a dataframe and fetching related data from another dataframe :PANDAS

Question

I have a dataframe having all transaction data of customers. columns are mailid,txn_date,city. Now I have a situation where I have to consider customer's from 01jan2016 and for each of mailid in that I have to fetch their txn data from base file and by considering their last 12 Month data(txn date between last Txn date and -365days timedelta) then finding out their max transacted city name.

sample base dataframe

#df
maild   txn_date   city
satya   2015-07-21  a
satya   2015-08-11  b
satya   2016-05-11  c
xyz     2016-06-01  f
satya   2016-06-01  a
satya   2016-06-01  b

As I need cust from 2016-01-01 so I did

d = df[['mailid', 'txn-date']][df['txn_date'] >= '2016-01-01']

now for each mailid in d I have to fetch each of their last 12Month transaction data from base Dataframe df and calculate their max city transacted. For that I am using a for loop like

x = d.groupby(['mailid'])['txn-date'].max().reset_index() #### finding their last transacted date to find out a 12 Month back date
x['max_city'] = 'N'  ## giving default value as 'N'
for idx,row in x.iterrows():
 g = row[1].date()
 h = g-timedelta(days=365)  ###getting the last 12 Month date 
 y = df[(df['mailid']==row[0]) & (df['txn_date'] >= str(h))]
 y.sort(['txn_date'],ascending=True,inplace=True)  ### sorting it bcoz i want to consider the last txn when count for one or more cities become same 
 c = y.groupby(['mailid','city']).size().reset_index()
 v = c.groupby(['mailid'])[0].max().reset_index()
 dca = pd.merge(y,c,on=['mailid','city'],how='left')
 dcb = pd.merge(dca,v,on=['mailid',0])
 m = dcb.drop_duplicates(['mailid'],take_last=True)
 row[2] = m['city'].unique()[0]

o/p:

maild  max_city
satya   b   ### as in last 12 month 2016-06-01 2015-06-01  txn in a=2 b= 2 so last is b so will consider b as max city
xyz     f

Though my code works(I am sure it is un-organised and no proper naming convention used as I was practicing) for small chunk of data and the loop will hit the main Base dataFrame df for each customer present in dataframe x.

So my main concern is if my df will be of 100Mln rows and x will be of 6mln Rows . then the for loop will executed 6Mln times and will hit df to fetch matched mailid data and do the operation to find max transacted city.

if in 1 min it will calculate 3 mailid's max city. then for 6mln it will take 2mln minutes... which will be a serious problem...

So need suggestion from you guys on how to optimize the scenario..thereby hitting main base fewer times and applying some more convenient PANDAS way to d that(which i am not able to do yet)...

Please, suggest!!!!Thanks in Adv.

City is a text column and you want to aggregate *max transacted city name*? Do you mean max of city count? Or the alphabetically highest city? — Parfait, Jun 01 '16 at 21:55
@Parfait-Exactly!!! i want max of city count or you can say Favourite city... — Satya, Jun 02 '16 at 04:59

score 1 · Answer 1 · answered Jun 01 '16 at 22:13

You can use the groupby and apply functionality more efficiently.

Group by both city and maild and get the maximum date and total number of transactions. Sort by the max date.

g=d.groupby(['maild','city'])['txn_date'].agg(['count','max']).sort_values('max',ascending=False)

Then group that by just maild and get the index of the highest count

g.groupby(level='maild')['count'].agg(lambda x:pd.Series.argmax(x)[1])

-

maild
satya    a
xyz      f

Btw, in your example you have transactions for satya for both a and b on 2016-01-01. How did you decide b was the right answer?

Victor Chubukov · Answer 2 · 2016-06-01T22:25:51.077

1

Here's an alternative that might be a little easier to read:

def f(g):
    dc=g.groupby('city')['txn_date']
    dc_sorted=dc.agg(['count','max']).sort_values('max',ascending=False)
    return dc_sorted['count'].argmax()

d.groupby(['maild']).apply(f)

However, I suspect that apply will have some performance issues on very large tables.

edited Jun 01 '16 at 22:25

answered Jun 01 '16 at 22:17

Victor Chubukov

1,345
1
10
18

@Victor-Your suggestion works fine on whole dataframe, but i have another condition like--- for each mail id i have to consider thier LAST 12Months Transactions...so how to fit that condition in your code? ex. if fo r 'satya' say there is another txn date '2014-02-01' then that should not be considered as his last txn date is '2016-06-01'and delta of 365 days is '2015-06-01'... so ia have to achieve this situation also. can you please help!!! – Satya Jun 02 '16 at 05:06
it seems that you can just filter the entire dataset first to exclude everything less than 12 months ago? – Victor Chubukov Jun 02 '16 at 14:23

score 0 · Answer 3 · answered Jun 03 '16 at 02:22

Consider an apply and transform function. The first runs group counts on maild and city according to time frame and second maximizes the count by maild. Finally the groupby() filtering at end keeps the pair whose count equals max count.

def last12mos(row):
    row['count'] = 0
    row.loc[row['txn_date'] >= row['txn_date'].max() - timedelta(days=365), 'count'] = 1
    row['count'] = row['count'].sum()
    return(row)

basedf = basedf.groupby(['maild', 'city']).apply(last12mos)    
basedf['maxcount'] = basedf.groupby(['maild'])['count'].transform(max)

finaldf = basedf[basedf['count'] == basedf['maxcount']].\
                                            groupby(['maild', 'city']).aggregate(max)

looping over a dataframe and fetching related data from another dataframe :PANDAS

3 Answers3

Linked