3

I have a large dataset that looks like the below. I'd like to know if there is a significant statistical difference between when the event occurs vs when it does not occur. The assumption here is that the higher the percent change the more meaningful/better.

In another dataset the "event occurs" column is "True, False, Neutral". (Please ignore the index as that is the default pandas index.)

   index    event occurs            percent change
    148       False                  11.27
    149        True                  14.56
    150       False                  10.35
    151       False                   6.07
    152       False                  21.14
    153       False                   7.26
    154       False                   7.07
    155       False                   5.37
    156        True                   2.75
    157       False                   7.12
    158       False                   7.24

What's the best way of determining the significance when it's "True/False" or when it's "True/False/Neutral"?

AnonPyDev
  • 67
  • 7
  • What have you tried? :) – DarkDrassher34 Nov 08 '19 at 16:58
  • Obv, nothing that works (yet)! :) – AnonPyDev Nov 08 '19 at 17:03
  • Let's split `event_occurs` into those `False` vs. those `True`. Find average `percent_change` for both and then run a shapiro-francis test to see if the data is normal. If it is, try to find if the difference in means is statistically significant. If it is not normal, get back to me. – DarkDrassher34 Nov 08 '19 at 17:08
  • How is an event occurring `Neutral`? – DarkDrassher34 Nov 08 '19 at 17:10
  • If the data for each group is not normal, just use distribution-free tests. Not as strong, but will do. – DarkDrassher34 Nov 08 '19 at 17:11
  • Instead of pasting an entire new set of data with three values, I just added the Neutral value which is logically equivalent. Found the solution here: https://stackoverflow.com/questions/13404468/t-test-in-pandas – AnonPyDev Nov 08 '19 at 17:14
  • Make sure you validate the assumptions of the student-t test. It does not work for all kinds of data. Hence, why I recommend you running test for normality for the groups in question. – DarkDrassher34 Nov 08 '19 at 17:17
  • @DarkDrassher34 - FYI. looks like some dataset are too large for shapiro so I used from scipy.stats import normaltest. Here's the error from shapiro: /usr/local/lib/python3.6/dist-packages/scipy/stats/morestats.py:1309: UserWarning: p-value may not be accurate for N > 5000. warnings.warn("p-value may not be accurate for N > 5000.") – AnonPyDev Nov 08 '19 at 17:40
  • OK, @DarkDrassher34 . The dataset does not appear normal: NormaltestResult(statistic=48.571451210317122, pvalue=2.8368957760644641e-11) – AnonPyDev Nov 08 '19 at 17:43
  • Ok. I want you to know that you would need to run the normal test for each individual group. Try returning `percent_change` where `event_occurs` is each unique value. Then run the test on each `percent_change` corresponding to each individual value of `event_occurs`. – DarkDrassher34 Nov 08 '19 at 17:52

2 Answers2

3

Load Packages, Set Globals, Make Data.

import scipy.stats as stats
import numpy as np

n = 60
stat_sig_thresh = 0.05

event_perc = pd.DataFrame({"event occurs": np.random.choice([True,False],n),
                          "percent change": [i*.1 for i in np.random.randint(1,1000,n)]})

Determine if Distribution is Normal

stat_sig = event_perc.groupby("event occurs").apply(lambda x: stats.normaltest(x))
stat_sig = pd.DataFrame(stat_sig)
stat_sig = pd.DataFrame(stat_sig[0].values.tolist(), index=stat_sig.index).reset_index()
stat_sig.loc[(stat_sig.pvalue <= stat_sig_thresh), "Normal"] = False
stat_sig["Normal"].fillna("True",inplace=True)

>>>stat_sig

    event occurs  statistic             pvalue                  Normal
0   False         [2.9171920993203915]  [0.23256255191146755]   True
1   True          [2.938332679486047]   [0.23011724484588764]   True

Determine Statistical Significance

normal = [bool(i) for i in stat_sig.Normal.unique().tolist()]

rvs1 = event_perc["percent change"][event_perc["event occurs"] == True]
rvs2 = event_perc["percent change"][event_perc["event occurs"] == False]

if (len(normal) == 1) & (normal[0] == True):
    print("the distributions are normal")
    if stats.ttest_ind(rvs1,rvs2).pvalue >= stat_sig_thresh:
        # we cannot reject the null hypothesis of identical average scores
        print("we can't say whether there is statistically significant difference")
    else:
        # we reject the null hypothesis of equal averages
        print("there is a statisically significant difference")

elif (len(normal) == 1) & (normal[0] == False):
    print("the distributions are not normal")
    if stats.wilcoxon(rvs1,rvs2).pvalue >= stat_sig_thresh:
        # we cannot reject the null hypothesis of identical average scores
        print("we can't say whether there is statistically significant difference")
    else:
        # we reject the null hypothesis of equal averages
        print("there is a statisically significant difference")
else:
    print("samples are drawn from different distributions")

the distributions are normal
we can't say whether there is statistically significant difference
ChrisDanger
  • 1,071
  • 11
  • 10
1

Thank you @DarkDrassher34 and @ChrisDanger. I put together this code sample from various sources originally from Dark's answer and then reviewed after Chris' post. Thoughts?

corr_data = df[['event occurs','percent change']]
cat1 = corr_data[corr_data['event occurs']==True]
cat2 = corr_data[corr_data['event occurs']==False]


#----------------------
# is the sample normal / gaussian
#----------------------
from scipy.stats import shapiro # test for normalcy in small samples
from scipy.stats import normaltest

if (len(cat1['percent change'].index) <= 20 ):
    stat1, p1 = shapiro(cat1['percent change'])
else:
    stat1, p1 = normaltest(cat1['percent change'])

if (len(cat2['percent change'].index) <= 20 ):
    stat2, p2 = shapiro(cat2['percent change'])
else:
    stat2, p2 = normaltest(cat2['percent change'])


alpha = 0.05 # stat threshold
# both groups are normal
if ((p1 > alpha) and (p2 > alpha)):
    print('Samples looks Gaussian (fail to reject H0)')

    #----------------------
    # if normal / gaussian run these tests
    #----------------------
    from scipy.stats import ttest_ind
    stat, p = ttest_ind(cat1['percent change'], cat2['percent change'])
    print('Statistics=%.3f, p=%.3f' % (stat, p))
    if p > alpha:
        print('Same distribution (fail to reject H0)')
    else:
        print('Different distribution (reject H0)')


else:
    print('Samples do not look Gaussian (reject H0)')
    #----------------------
    # if not normal / gaussian run these tests
    #----------------------
    from scipy.stats import mannwhitneyu
    stat, p = mannwhitneyu(cat1['percent change'], cat2['percent change'])
    print('Statistics=%.3f, p=%.3f' % (stat, p))
    if p > alpha:
        print('Same distribution (fail to reject H0)')
    else:
        print('Different distribution (reject H0)')
AnonPyDev
  • 67
  • 7