0

I have the following issue: I have a dataframe with 3 columns : The first is userID, the second is invoiceType and the third the time of creation of the invoice.

df = pd.read_csv('invoice.csv')
Output: UserID  InvoiceType   CreateTime
         1         a          2018-01-01 12:31:00
         2         b          2018-01-01 12:34:12
         3         a          2018-01-01 12:40:13
         1         c          2018-01-09 14:12:25
         2         a          2018-01-12 14:12:29
         1         b          2018-02-08 11:15:00
         2         c          2018-02-12 10:12:12

I am trying to plot the invoice cycle for each user. I need to create2 new columns, time_diff, and time_diff_wrt_first_invoice. time_diff will represent the time difference between each invoice for each user and time_diff_wrt_first_invoice will represent the time difference between all the invoices and the first invoice, which will be interesting for ploting purposes. This is my code:

"""
********** Exploding a variable that is a list in each dataframe cell 

"""
def explode_list(df,x):
  return (df[x].apply(pd.Series)
  .stack()
  .reset_index(level = 1, drop=True)
  .to_frame(x))

"""
  ****** applying explode_list to all the columns ******
"""

def explode_listDF(df):
    exploaded_df = pd.DataFrame()

    for x in df.columns.tolist():
        exploaded_df = pd.concat([exploaded_df, explode_list(df,x)], 
        axis = 1)

    return exploaded_df


 """
   ******** Getting the time difference column in pivot table format
 """
def pivoted_diffTime(df1, _freq=60):

    # _ freq is 1 for minutes frequency
    # _freq is 60 for hour frequency
    # _ freq is 60*24 for daily frequency
    # _freq is 60*24*30 for monthly frequency

    df = df.sort_values(['UserID', 'CreateTime'])

    df_pivot = df.pivot_table(index = 'UserID', 
                         aggfunc= lambda x : list(v for v in x)
                         )

    df_pivot['time_diff'] = [[0]]*len(df_pivot)

    for user in df_pivot.index:

        try:    
           _list = [0]+[math.floor((x - y).total_seconds()/(60*_freq)) 
           for x,y in zip(df_pivot.loc[user, 'CreateTime'][1:], 
           df_pivot.loc[user, 'CreateTime'][:-1])]

           df_pivot.loc[user, 'time_diff'] = _list


        except:
            print('There is a prob here :', user)

    return df_pivot


"""
***** Pipelining the two functions to obtain an exploaded dataframe 
 with time difference ******
"""
def get_timeDiff(df, _frequency):

    df = explode_listDF(pivoted_diffTime(df, _freq=_frequency))

    return df

And once I have time_diff, I am creating time_diff_wrt_first_variable this way:

# We initialize this variable
df_with_timeDiff['time_diff_wrt_first_invoice'] = 
[[0]]*len(df_with_timeDiff)

# Then we loop over users and we apply a cumulative sum over time_diff
for user in df_with_timeDiff.UserID.unique():

 df_with_timeDiff.loc[df_with_timeDiff.UserID==user,'time_diff_wrt_first_i nvoice'] = np.cumsum(df_with_timeDiff.loc[df_with_timeDiff.UserID==user,'time_diff'])

The problem is that I have a dataframe with hundreds of thousands of users and it's so time consuming. I am wondering if there is a solution that fits better my need.

skrx
  • 19,980
  • 5
  • 34
  • 48
Alaa J E
  • 31
  • 5
  • have you used cumsum() instead. see https://stackoverflow.com/a/39623235/461887 – sayth Jul 30 '18 at 11:07
  • I have used cumsum for the 2nd question, but I think It may fit better with an aggregation method rather than a for loop as I have do it. Thank you for your answer. But for the first question, in order to create the column `time_diff` I am creating a new variable, where, for each user, the first value will be 0, the second value is t2-t1, the 3rd is t3-t2, ... – Alaa J E Jul 30 '18 at 11:58

2 Answers2

0

Check out .loc[] for pandas.

    df_1 = pd.DataFrame(some_stuff)

    df_2 = df_1.loc[tickers['column'] >= some-condition, 'specific-column']        

you can access specific columns, run a loop to check for certain types of conditions, and if you add a comma after the condition and put in a specific column name it'll only return that column. I'm not 100% sure if that answers whatever question you're asking, cause I didn't actually see one, but it seemed like you were running a lot of for loops and stuff to isolate columns, which is what .loc[] is for.

Puhtooie
  • 126
  • 7
  • Thank you for your answer. Ia m aware of .loc, the problem is that I need to do everything separately per each user, this is why I have all these for loops. I think that the `pivot_table` is consuming lot of time. – Alaa J E Jul 30 '18 at 12:00
  • Are you trying to see how frequently the users come on? Or? – Puhtooie Jul 30 '18 at 15:05
  • Not really how frequently they come on. More precisely, I have an invoice system, where a user gets an invoice, either when it's a renewal, a modification, a cancellation, a subscription, .... each invoice has its createTime. My main goal is to create a new column that I call time_diff that will detect, for each user, the time difference between each invoice. – Alaa J E Jul 31 '18 at 13:12
0

I have found a better solution. Here's my code :

def next_diff(x):
   return ([0]+[(b-a).total_seconds()/3600 for b,a in zip(x[1:], x[:-1])])


def create_timediff(df):

   df.sort_values(['UserID', 'CreateTime'], inplace=True)
   a = df.groupby('UserID').agg({'CreateTime' :lambda x : list(v for v in x)}).CreateTime.apply(next_diff)
   b = a.apply(np.cumsum)

   a = a.reset_index()
   b = b.reset_index()

   # Here I explode the lists inside the cell
   rows1= []
   _ = a.apply(lambda row: [rows1.append([row['UserID'], nn]) 
                     for nn in row.CreateTime], axis=1)
   rows2 = []
   __ = b.apply(lambda row: [rows2.append([row['UserID'], nn]) 
                     for nn in row.CreateTime], axis=1)

   df1_new = pd.DataFrame(rows1, columns=a.columns).set_index(['UserID'])
   df2_new = pd.DataFrame(rows2, columns=b.columns).set_index(['UserID'])

   df = df.set_index('UserID')
   df['time_diff']= df1_new['CreateTime']
   df['time_diff_wrt_first_invoice'] = df2_new['CreateTime']
   df.reset_index(inplace=True)

   return df
Alaa J E
  • 31
  • 5