1

I have a dataset of more than 100 columns, I want to find out if the data of the column is normally distributed or not? if not then i have to make it normally distributed, i am curious if there is a way where i can find this out logically, finding it out manually is tiresome and confusing. I tried this but my logic is failing

def find_normal_dist(_colname):
    column_name=_colname
    f_f1=zscore(x[column_name])<=1
    f_f2=zscore(x[column_name])>=-1

    s_f3=zscore(x[column_name])>1
    s_f4=zscore(x[column_name])<2

    s_f5=zscore(x[column_name])>-2
    s_f6=zscore(x[column_name])<-1

    t_f3=zscore(x[column_name])>2
    t_f4=zscore(x[column_name])<3

    t_f5=zscore(x[column_name])>-3
    t_f6=zscore(x[column_name])<-2
    
    std_2_p=len(x[column_name][s_f3 & s_f4])
    std_2_n=len(x[column_name][s_f5 & s_f6])

    std_3_p=len(x[column_name][t_f3 & t_f4])
    std_3_n=len(x[column_name][t_f5 & t_f6])
    
    one_std_dev=(len(x[column_name][f_f1 & f_f2])/len(x))*100
    two_std_dev=(std_2_p+std_2_n)/len(x)
    three_std_dev=(std_3_p+std_3_n)/len(x)
    return '1 {} 2 {} 3 {}'.format(round(one_std_dev),round(two_std_dev),round(three_std_dev)

i am using kaggle dataset

E_net4
  • 27,810
  • 13
  • 101
  • 139
Lijin Durairaj
  • 4,910
  • 15
  • 52
  • 85

1 Answers1

2

Investigate a QQ plot, or run Shapiro-Wilk to test if the data are normal.

Alex Reynolds
  • 95,983
  • 54
  • 240
  • 345