I've researched previous similar questions, but couldn't find any applicable leads:
I have a dataframe, called "df" which is roughly structured as follows:
Income Income_Quantile Score_1 Score_2 Score_3
0 100000 5 75 75 100
1 97500 5 80 76 94
2 80000 5 79 99 83
3 79000 5 88 78 91
4 70000 4 55 77 80
5 66348 4 65 63 57
6 67931 4 60 65 57
7 69232 4 65 59 62
8 67948 4 64 64 60
9 50000 3 66 50 60
10 49593 3 58 51 50
11 49588 3 58 54 50
12 48995 3 59 59 60
13 35000 2 61 50 53
14 30000 2 66 35 77
15 12000 1 22 60 30
16 10000 1 15 45 12
Using the "Income_Quantile" column and the following "for-loop", I divided the dataframe into a list of 5 subset dataframes (which each contain observations from the same income quantile):
dfs = []
for level in df.Income_Quantile.unique():
df_temp = df.loc[df.Income_Quantile == level]
dfs.append(df_temp)
Now, I would like to apply the following function for calculating the spearman correlation, p-value and t-statistic to the dataframe (fyi: scipy.stats functions are used in the main function):
def create_list_of_scores(df):
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
return df_result
The functions that "create_list_of_scores" uses, i.e. "ttest_ind" and "ttest_ind", can be accessed from scipy.stats as follows:
- from scipy.stats import ttest_ind
- from scipy.stats import spearmanr
I tested the function on one subset of the dataframe:
data = dfs[1]
result = create_list_of_scores(data)
It works as expected.
However, when it comes to applying the function to the entire list of dataframes, "dfs", a lot of issues arise. If I apply it to the list of dataframes as follows:
result = pd.concat([create_list_of_scores(d) for d in dfs], axis=1)
I get the output as the columns "Score_1, Score_2, and Score_3" x 5.
I would like to:
- Have just three columns "Score_1, Score_2, and Score_3".
- Index the output using the t-statistic, p-value and correlations as the first level index, and; the "Income_Quantile" as the second level index.
Here is what I have in mind:
Score_1 Score_2 Score_3
t-statistic 1
2
3
4
5
p-value 1
2
3
4
5
correlation 1
2
3
4
5
Any idea on how I can merge the output of my function as requested?