I have the following issue: I have a dataframe with 3 columns : The first is userID, the second is invoiceType and the third the time of creation of the invoice.
df = pd.read_csv('invoice.csv')
Output: UserID InvoiceType CreateTime
1 a 2018-01-01 12:31:00
2 b 2018-01-01 12:34:12
3 a 2018-01-01 12:40:13
1 c 2018-01-09 14:12:25
2 a 2018-01-12 14:12:29
1 b 2018-02-08 11:15:00
2 c 2018-02-12 10:12:12
I am trying to plot the invoice cycle for each user. I need to create2 new columns, time_diff
, and time_diff_wrt_first_invoice
. time_diff
will represent the time difference between each invoice for each user and time_diff_wrt_first_invoice
will represent the time difference between all the invoices and the first invoice, which will be interesting for ploting purposes. This is my code:
"""
********** Exploding a variable that is a list in each dataframe cell
"""
def explode_list(df,x):
return (df[x].apply(pd.Series)
.stack()
.reset_index(level = 1, drop=True)
.to_frame(x))
"""
****** applying explode_list to all the columns ******
"""
def explode_listDF(df):
exploaded_df = pd.DataFrame()
for x in df.columns.tolist():
exploaded_df = pd.concat([exploaded_df, explode_list(df,x)],
axis = 1)
return exploaded_df
"""
******** Getting the time difference column in pivot table format
"""
def pivoted_diffTime(df1, _freq=60):
# _ freq is 1 for minutes frequency
# _freq is 60 for hour frequency
# _ freq is 60*24 for daily frequency
# _freq is 60*24*30 for monthly frequency
df = df.sort_values(['UserID', 'CreateTime'])
df_pivot = df.pivot_table(index = 'UserID',
aggfunc= lambda x : list(v for v in x)
)
df_pivot['time_diff'] = [[0]]*len(df_pivot)
for user in df_pivot.index:
try:
_list = [0]+[math.floor((x - y).total_seconds()/(60*_freq))
for x,y in zip(df_pivot.loc[user, 'CreateTime'][1:],
df_pivot.loc[user, 'CreateTime'][:-1])]
df_pivot.loc[user, 'time_diff'] = _list
except:
print('There is a prob here :', user)
return df_pivot
"""
***** Pipelining the two functions to obtain an exploaded dataframe
with time difference ******
"""
def get_timeDiff(df, _frequency):
df = explode_listDF(pivoted_diffTime(df, _freq=_frequency))
return df
And once I have time_diff, I am creating time_diff_wrt_first_variable this way:
# We initialize this variable
df_with_timeDiff['time_diff_wrt_first_invoice'] =
[[0]]*len(df_with_timeDiff)
# Then we loop over users and we apply a cumulative sum over time_diff
for user in df_with_timeDiff.UserID.unique():
df_with_timeDiff.loc[df_with_timeDiff.UserID==user,'time_diff_wrt_first_i nvoice'] = np.cumsum(df_with_timeDiff.loc[df_with_timeDiff.UserID==user,'time_diff'])
The problem is that I have a dataframe with hundreds of thousands of users and it's so time consuming. I am wondering if there is a solution that fits better my need.