0

I am wondering if there is a way to calculate the cumulative p_value for each hour of data in a dataframe. For example if you have 24 hours of data there would be 24 measurements of p_value, but they would be cumulative for all hours before the current hour.

I have been able to get the p_value for each hour by grouping my data by hour and then applying an agg_func that I wrote to calculate all of the relevant statistics necessary to calculate p. However, this approach does not produce a cumulative result, only the p for each individual hour.

Given a df with columns id, ts (as unix timestamp), ab_group, result. I ran the following code to compute p_values on the hour.

df['time'] = pd.to_datetime(df.ts, unit='s').values.astype('<m8[h]')

def calc_p(group):
    df_old_len = len(group[group.ab_group == 0])
    df_new_len = len(group[group.ab_group == 1])
    ctr_old = float(len(group[(group.ab_group == 0) & (df.result == 1)]))/ df_old_len
    ctr_new = float(len(group[(group.ab_group == 1) & (df.converted == 1)]))/ df_new_len
    nobs_old = df_old_len
    nobs_new = df_new_len
    z_score, p_val, null = z_test.z_test(ctr_old, ctr_new, nobs_old, nobs_new, effect_size=0.001)
    return p_val

grouped = df.groupby(by='time').agg(calc_p)

N.B. z_test is my own module containing an implementation of a z_test.

Any advice on how to modify this for a cumulative p is much appreciated.

Grr
  • 15,553
  • 7
  • 65
  • 85
  • 1
    http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – Paul H Sep 16 '16 at 17:04
  • I don't think the p value itself, or the components of its calculation, are easily transformed into something additive. – Ami Tavory Sep 16 '16 at 17:04
  • @AmiTavory I came up with a solution. Ended up having to set each component to a global variable and updating within the function. – Grr Sep 16 '16 at 17:54

1 Answers1

0

So i came up with a workaround on my own for this one.

What I came up with was modifying calc_p() such that it utilized global variables and thus could use updated values each time it was called by the aggfunc. Below is the edited code:

def calc_p(group):
    global df_old_len, df_new_len, clicks_old, clicks_new
    clicks_old += len(group[(group.landing_page == 'old_page') & (group.converted == 1)])
    clicks_new += len(group[(group.landing_page == 'new_page') & (group.converted == 1)])
    df_old_len += len(group[group.landing_page == 'old_page'])
    df_new_len += len(group[group.landing_page == 'new_page'])
    ctr_old = float(clicks_old)/df_old_len
    ctr_new = float(clicks_new)/df_new_len
    z_score, p_val, null = z_test.z_test(ctr_old, ctr_new, df_old_len, df_new_len, effect_size=0.001)
    return p_val

# Initialize global values to 0 for cumulative calc_p
df_old_len = 0
df_new_len = 0
clicks_old = 0
clicks_new = 0

grouped = df.groupby(by='time').agg(calc_p)
Grr
  • 15,553
  • 7
  • 65
  • 85