correlation analysis in a splitted dataframe (to each split individually) and compare the results

Question

i have data for labor production of a building site, and data of lots of factors that possibly affects this labor productivity. but all of the data is in a single dataframe, so i need to split it according to each stage of the building site (i can't do it manually because it's like 100+ stages, and i want to automate it). splitting the DF i can make a correlation analysis for each stage individualy. my main goal is to compare the influence of this factors in each stage, side-by-side. checking this image can make it more undestandable.

TKS!

Does this answer your question? [Pandas Correlation Groupby](https://stackoverflow.com/questions/28988627/pandas-correlation-groupby) — Leo, Aug 18 '22 at 23:09

score 0 · Accepted Answer · answered Aug 01 '22 at 21:28

It sounds like you have a single DataFrame with the following columns: stage, productivity, factor A, factor B, ..., factor X (your input) and you want a correlation matrix of the productivity and factors by stage, in particular to look at the correlation between each factor and productivity at each stage (your output). If I'm understanding correctly, then you could go about creating the tables in your image like this (say your data is in DataFrame df, here I've used random values):

Correlation matrix by stage:

corr_mat_by_stage = df.groupby( "stage" ).corr()

With stages A and B and factors a-c this would look like this (the middle table in your image):

>>> corr_mat_by_stage.round( 2 )
                    factor a  factor b  factor c  productivity
stage                                                         
A     factor a          1.00     -0.28      0.81         -0.78
      factor b         -0.28      1.00      0.07          0.12
      factor c          0.81      0.07      1.00         -0.39
      productivity     -0.78      0.12     -0.39          1.00
B     factor a          1.00      0.45      0.59          0.52
      factor b          0.45      1.00      0.82          0.80
      factor c          0.59      0.82      1.00          0.75
      productivity      0.52      0.80      0.75          1.00

Then, to get the more concise table that just has the productivity row/column for each stage, you can filter for that column and pivot to reshape so that you have the stages as columns like in your third table in the image:

prod_factor_corr_by_stage = corr_mat_by_stage["productivity"].reset_index().pivot( index="level_1", columns="stage", values="productivity" )

This would look like this (the right-most table in your image):

>>> prod_factor_corr_by_stage.round( 2 )
stage            A     B
level_1                 
factor a     -0.78  0.52
factor b      0.12  0.80
factor c     -0.39  0.75
productivity  1.00  1.00

Leo, that was exactly what I was expecting, THANK YOU SO MUCH! — Lucas Franz, Aug 07 '22 at 22:32

correlation analysis in a splitted dataframe (to each split individually) and compare the results

1 Answers1