I am wondering if there is a way to calculate the cumulative p_value for each hour of data in a dataframe. For example if you have 24 hours of data there would be 24 measurements of p_value, but they would be cumulative for all hours before the current hour.
I have been able to get the p_value for each hour by grouping my data by hour and then applying an agg_func that I wrote to calculate all of the relevant statistics necessary to calculate p. However, this approach does not produce a cumulative result, only the p for each individual hour.
Given a df with columns id, ts (as unix timestamp), ab_group, result. I ran the following code to compute p_values on the hour.
df['time'] = pd.to_datetime(df.ts, unit='s').values.astype('<m8[h]')
def calc_p(group):
df_old_len = len(group[group.ab_group == 0])
df_new_len = len(group[group.ab_group == 1])
ctr_old = float(len(group[(group.ab_group == 0) & (df.result == 1)]))/ df_old_len
ctr_new = float(len(group[(group.ab_group == 1) & (df.converted == 1)]))/ df_new_len
nobs_old = df_old_len
nobs_new = df_new_len
z_score, p_val, null = z_test.z_test(ctr_old, ctr_new, nobs_old, nobs_new, effect_size=0.001)
return p_val
grouped = df.groupby(by='time').agg(calc_p)
N.B. z_test is my own module containing an implementation of a z_test.
Any advice on how to modify this for a cumulative p is much appreciated.