List of Pandas Dataframes: Merging Function Outputs

Question

I've researched previous similar questions, but couldn't find any applicable leads:

I have a dataframe, called "df" which is roughly structured as follows:

    Income  Income_Quantile Score_1 Score_2 Score_3
0   100000  5                75        75    100
1   97500   5                80        76    94
2   80000   5                79        99    83
3   79000   5                88        78    91
4   70000   4                55        77    80
5   66348   4                65        63    57
6   67931   4                60        65    57
7   69232   4                65        59    62
8   67948   4                64        64    60
9   50000   3                66        50    60
10  49593   3                58        51    50
11  49588   3                58        54    50
12  48995   3                59        59    60
13  35000   2                61        50    53
14  30000   2                66        35    77
15  12000   1                22        60    30
16  10000   1                15        45    12

Using the "Income_Quantile" column and the following "for-loop", I divided the dataframe into a list of 5 subset dataframes (which each contain observations from the same income quantile):

dfs = []

for level in df.Income_Quantile.unique():
    df_temp = df.loc[df.Income_Quantile == level]
    dfs.append(df_temp)

Now, I would like to apply the following function for calculating the spearman correlation, p-value and t-statistic to the dataframe (fyi: scipy.stats functions are used in the main function):

def create_list_of_scores(df):

    df_result = pd.DataFrame(columns=cols)
    df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
    df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
    df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]

    return df_result

The functions that "create_list_of_scores" uses, i.e. "ttest_ind" and "ttest_ind", can be accessed from scipy.stats as follows:

from scipy.stats import ttest_ind
from scipy.stats import spearmanr

I tested the function on one subset of the dataframe:

data = dfs[1]
result = create_list_of_scores(data)

It works as expected.

However, when it comes to applying the function to the entire list of dataframes, "dfs", a lot of issues arise. If I apply it to the list of dataframes as follows:

result = pd.concat([create_list_of_scores(d) for d in dfs], axis=1)

I get the output as the columns "Score_1, Score_2, and Score_3" x 5.

I would like to:

Have just three columns "Score_1, Score_2, and Score_3".
Index the output using the t-statistic, p-value and correlations as the first level index, and; the "Income_Quantile" as the second level index.

Here is what I have in mind:

                  Score_1  Score_2  Score_3
t-statistic 1           
            2           
            3           
            4           
            5           
p-value     1           
            2           
            3           
            4           
            5           
correlation 1           
            2           
            3           
            4           
            5

Any idea on how I can merge the output of my function as requested?

Can you add functions `ttest_ind` and `spearmanr` to question? — jezrael, Mar 05 '20 at 06:14
They are scipy functions: from scipy.stats import ttest_ind from scipy.stats import spearmanr — john_mon, Mar 05 '20 at 06:23
Thank you, is order of first level in `MultiIndex` important? Or is possible use like my answer now? — jezrael, Mar 05 '20 at 06:34

jezrael · Accepted Answer · 2020-03-05T06:31:18.540

I think better is use GroupBy.apply:

cols = ['Score_1','Score_2','Score_3']
def create_list_of_scores(df):

    df_result = pd.DataFrame(columns=cols)
    df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
    df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
    df_result.loc['correlation'] = [spearmanr(df['Income'], df[x])[1] for x in cols]
    return df_result

df = df.groupby('Income_Quantile').apply(create_list_of_scores).swaplevel(0,1).sort_index()
print (df)
                                  Score_1       Score_2       Score_3
            Income_Quantile                                          
correlation 1                         NaN           NaN           NaN
            2                         NaN           NaN           NaN
            3                6.837722e-01  0.000000e+00  1.000000e+00
            4                4.337662e-01  6.238377e-01  4.818230e-03
            5                2.000000e-01  2.000000e-01  2.000000e-01
p-value     1                8.190692e-03  8.241377e-03  8.194933e-03
            2                5.887943e-03  5.880440e-03  5.888611e-03
            3                3.606128e-13  3.603267e-13  3.604996e-13
            4                5.584822e-14  5.587619e-14  5.586583e-14
            5                3.861801e-06  3.862192e-06  3.864736e-06
t-statistic 1                1.098143e+01  1.094719e+01  1.097856e+01
            2                1.297459e+01  1.298294e+01  1.297385e+01
            3                2.391611e+02  2.391927e+02  2.391736e+02
            4                1.090548e+02  1.090479e+02  1.090505e+02
            5                1.594605e+01  1.594577e+01  1.594399e+01

List of Pandas Dataframes: Merging Function Outputs

1 Answers1