Applying a Correlation Function to Multiple Subsets of a Dataframe and Concatenating the Results in one Frame

Question

I have a Pandas dataframe called "df" with the following columns:

    Income  Income_Quantile Score_1 Score_2 Score_3
0   100000              5     75      75    100
1   70000               4     55      77    80
2   50000               3     66      50    60
3   12000               1     22      60    30
4   35000               2     61      50    53
5   30000               2     66      35    77

I also have a "for-loop" for selecting subsets of the dataframe using the "Income_Quantile" variable. The loop subsequently drops the "Income_Quantile" variable that was used to slice the main dataframe; "df".

Here is the code:

for level in df.Income_Quantile.unique():
    df_s = df.loc[df.Income_Quantile == level].drop('Income_Quantile', 1)

Now, I want to calculate the spearman's rank correlation of the "Income" variable to the "Score_1", "Score_2" and "Score_3" variables in the "df_s".

I would also like to concatenate the results in a single frame, with the following structure:

            Income Quantile  Score_1    Score_2     Score_3
correlation         ….         ….          ….          ….
p-value             ….         ….          ….          ….
t-statistic         ….         ….          ….          ….

I think that the approach below, from a previous question I asked, could be helpful:

result = dict({key: correlations(val) for key, val in df_s.items()}) '''"correlations" will be a helper function for calculating the Spearman's rank correlation of each of the subsets to the "Income" variable and outputing the p-value and t-statistic of the test for each each variable.'''

But, I currently have no clues on how to effect the next steps.

Does anyone have any pointers on how I can get from where I currently am to where I want to be? This happens to be my weakest area in Python and I am stuck.

@davidbilla. I am currently working on it. Still researching how I can get p-values and t-statistics for my correlations. Will update that particular section once the function is complete. — john_mon, Mar 04 '20 at 18:17

score 1 · Accepted Answer · answered Mar 04 '20 at 18:23

Is this what you are expecting?

cols = ['Score_1','Score_2','Score_3']
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation']= [spearmanr(df['Income'], df[x])[1] for x in cols]
print(df_result)

Output:

              Score_1   Score_2   Score_3
t-statistic  3.842307  3.842281  3.841594
p-value      0.003253  0.003253  0.003257
correlation  0.257369  0.227784  0.041563

Here df_result['Score_1'] is the result of t-statistic, p-value and spearman correlation of df['Income'] & df['Score_1']. Let me know if this helps.

Yes! I now need to figure out how to link that code block to the for look that generates the dataframes. and to concatenate all the findings from the dfs to one frame. — john_mon, Mar 04 '20 at 18:37

Applying a Correlation Function to Multiple Subsets of a Dataframe and Concatenating the Results in one Frame

1 Answers1