1

I use the following python command, and get a result I do not know exactly how to convert to hours or any other time standard. My goal is to understand mean, min and max values of the behavior of two student groups logging in, and counting all logons per student in each group until a cut date.

Data (col2) - result of subtracting different logon times per student from date when student ends enrolment with university (numbers with a minus in col2 indicate that there are no logons after the cut date):

        User Name      status  Col1  ...       Col2               Check Check all
4053       191446  terminated  Diploming  ...  -62 days +00:31:00     0         0
4054       191446  terminated  Diploming  ...  -62 days +00:23:00     0         0
4055       191446  terminated  Diploming  ...  -62 days +00:20:00     0         0
55838      190690  terminated  Diploming  ... -142 days +21:17:00     0         0
55839      190690  terminated  Diploming  ... -142 days +15:10:00     0         0
...           ...         ...        ...  ...                 ...   ...       ...
464817     194056  terminated   Bachelor  ... -103 days +05:19:00     0         0
466349     193858  terminated   Bachelor  ... -103 days +20:23:00     0         0
467668     194060  terminated   Bachelor  ... -153 days +09:59:00     0         0
468590     194018  terminated   Bachelor  ... -154 days +23:07:00     0         0
469805     175440  terminated   Bachelor  ... -154 days +13:38:00     0         0

This is the code I am using

result = result.groupby('col1').agg({'col2': ['mean', 'min', 'max']})

print("\nMean, min, and max values of student line logons - indifinite")
print(result)

print("\nresult.columns")
print(result.columns)

This is the result I get

Mean, min, and max values of student line logons
               time_diff                            
                    mean           min           max
Col1                                          
Bachelor   -8.089769e+12 -1.872721e+13 -3.622951e+10
Diploming  -8.718830e+12 -1.586661e+13 -1.426230e+10

How can I get a meaningful result from the 'Mean, min, and max values of student line logons', potentially presenting result like col2 in 'Data (col2)' above? Any wise input from out there is appreciated...;o) Thank you.

1 Answers1

0

Use Series.dt.total_seconds for convert timedeltas to seconds, aggregate and if necessary convert seconds to timedeltas back:

result['Col2'] = result['Col2'].dt.total_seconds()
result = (result.groupby('Col1')['Col2'].agg(['mean', 'min', 'max'])
                .apply(pd.to_timedelta, unit='s'))
print (result)
                         mean                 min                 max
Col1                                                                 
Bachelor  -133 days +04:53:12 -154 days +13:38:00 -103 days +20:23:00
Diploming  -94 days +07:32:12 -142 days +15:10:00  -62 days +00:31:00

EDIT: For ploting is best use:

result['Col2'] = result['Col2'].dt.total_seconds()
result1 = result.groupby('Col1')['Col2'].agg(['mean', 'min', 'max'])

If want avoid error with replace missing values by 0 use:

result['Col2'] = result['Col2'].dt.total_seconds()
result2 = (result.groupby('Col1')['Col2'].agg(['mean', 'min', 'max'])
                .apply(pd.to_timedelta, unit='s')
                .fillna(pd.Timedelta(0))

                
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Hi jezrael, can you assist me by showing how to visualize the solution, i.e. I get an error when attempting to applying matplotlib. It says as follows: TypeError: Passing integers to fillna for timedelta64[ns] dtype is no longer supported. To obtain the old behavior, pass `pd.Timedelta(seconds=n)` instead. The code used to arrive at this error message is: import matplotlib.pyplot as plt plt.style.use('ggplot') ax = result.plot(kind='bar', title ="Title",figsize=(15,10),legend=True, fontsize=20) ax.set_xlabel("Line",fontsize=15) ax.set_ylabel("Number of logons",fontsize=15) plt.show() – Hubsandspokes Dec 01 '20 at 16:01
  • @Hubsandspokes I think the best is use my solution without convert back to timeltas, so use only `result.groupby('Col1')['Col2'].agg(['mean', 'min', 'max'])`. Your error should be solved by `.fillna(pd.Timedelta(0))` – jezrael Dec 01 '20 at 16:53
  • Hi jezrael, may I ask you to be a little more specific? - First, following your initial solution everything worked fine, and I got the days, etc response per Col1 and Col2. Second, I wanted to 'wind back' to be able to visualize, but I got errors from following your suggestion. Thus, I am a bit confused reading your line '...Your error should be solved by .fillna(pd.Timedelta(0))...'. Is there a chance you could deliver the full code, re your first answer? - Thx, very much, in advance, BR Hubsandspokes – Hubsandspokes Dec 02 '20 at 11:40
  • @Hubsandspokes - Answer was edited, I hope now it is more clear. – jezrael Dec 02 '20 at 11:45
  • Trying this time to convert numpy code below to same days, hours, mins, secs as the first time...X = number_of_logons_all.values split = round(len(X) / 2) X1, X2 = X[0:split], X[split:] mean1, mean2 = X1.mean(), X2.mean() var1, var2 = X1.var(), X2.var() print('mean1=%f, mean2=%f' % (mean1, mean2)) print('variance1=%f, variance2=%f' % (var1, var2)) I got the this: mean1=60785.792548, mean2=61291.266868 variance1=7483553053.651829, variance2=7603208729.348722. Hoped to get same result as last time. I put np.array in a .to_numpy dataframe. hmmm, still get 60785.7925 and so on – Hubsandspokes Dec 08 '20 at 07:59
  • @Hubsandspokes - Is possible create new question? Unfortunately not understand your problem :( – jezrael Dec 08 '20 at 08:01
  • Pls, look at this, new attempt to address my issue...;o) - https://stackoverflow.com/questions/65197015/how-do-i-convert-numpy-array-to-days-hours-mins – Hubsandspokes Dec 08 '20 at 10:18