I have a dataframe having all transaction data of customers. columns are mailid,txn_date,city. Now I have a situation where I have to consider customer's from 01jan2016 and for each of mailid in that I have to fetch their txn data from base file and by considering their last 12 Month data(txn date between last Txn date and -365days timedelta) then finding out their max transacted city name.
sample base dataframe
#df
maild txn_date city
satya 2015-07-21 a
satya 2015-08-11 b
satya 2016-05-11 c
xyz 2016-06-01 f
satya 2016-06-01 a
satya 2016-06-01 b
As I need cust from 2016-01-01 so I did
d = df[['mailid', 'txn-date']][df['txn_date'] >= '2016-01-01']
now for each mailid in d I have to fetch each of their last 12Month transaction data from base Dataframe df and calculate their max city transacted. For that I am using a for loop like
x = d.groupby(['mailid'])['txn-date'].max().reset_index() #### finding their last transacted date to find out a 12 Month back date
x['max_city'] = 'N' ## giving default value as 'N'
for idx,row in x.iterrows():
g = row[1].date()
h = g-timedelta(days=365) ###getting the last 12 Month date
y = df[(df['mailid']==row[0]) & (df['txn_date'] >= str(h))]
y.sort(['txn_date'],ascending=True,inplace=True) ### sorting it bcoz i want to consider the last txn when count for one or more cities become same
c = y.groupby(['mailid','city']).size().reset_index()
v = c.groupby(['mailid'])[0].max().reset_index()
dca = pd.merge(y,c,on=['mailid','city'],how='left')
dcb = pd.merge(dca,v,on=['mailid',0])
m = dcb.drop_duplicates(['mailid'],take_last=True)
row[2] = m['city'].unique()[0]
o/p:
maild max_city
satya b ### as in last 12 month 2016-06-01 2015-06-01 txn in a=2 b= 2 so last is b so will consider b as max city
xyz f
Though my code works(I am sure it is un-organised and no proper naming convention used as I was practicing) for small chunk of data and the loop will hit the main Base dataFrame df for each customer present in dataframe x.
So my main concern is if my df will be of 100Mln rows and x will be of 6mln Rows . then the for loop will executed 6Mln times and will hit df to fetch matched mailid data and do the operation to find max transacted city.
if in 1 min it will calculate 3 mailid's max city. then for 6mln it will take 2mln minutes... which will be a serious problem...
So need suggestion from you guys on how to optimize the scenario..thereby hitting main base fewer times and applying some more convenient PANDAS way to d that(which i am not able to do yet)...
Please, suggest!!!!Thanks in Adv.