0

I have a data set with the growth in student enrollments by college from one year to the next broken down by age bands (18-19, 20-24, etc.). I have another data set with the growth in student enrollments for the same colleges from one year to the next broken down by gender (M, F, O). Unfortunately, we don't have access to the raw data so I don't know the relationship between these (e.g. how many males 18-19, females 20-24, etc.).

Is there a way to do a correlation analysis on these separate datasets against each other to imply some relationships? E.g. I'm trying to see if I can reach any conclusions like "the growth in the 20-24 age band was more strongly correlated to the growth in female vs. male students"?

I have the two datasets loaded in dataframes and have already prepared some basis plots showing trend etc. I did manage to brute-force an age by gender view in excel but wanted to hear others' ideas on the above before I attempt to replicate it in python...

NikG
  • 111
  • 5
  • Appreciate the feedback... ViggoTW pretty much summarized what I'm trying to do below. – NikG Dec 04 '22 at 05:18

1 Answers1

1

It would be nice with an example of what your two datasets look like. However, I will go out on a limb and guess/assume that they look something like this:

> df_enrollment.head()
    growth  age_group   college
0   0.941251    19-35   E
1   0.787922    19-35   D
2   0.677788    36-50   C
3   0.088465    36-50   A
4   0.453523    19-35   D

> df_growth_gender.head()
    growth  gender  college
0   0.352022    Male    E
1   0.560317    Other   D
2   0.181704    Female  E
3   0.278119    Female  D
4   0.029306    Other   B

If my assumption of your datasets are somewhat correct, I would recommend first joining the two datsets into one dataset:

df = pd.merge(
    left=df_growth_age, 
    right=df_growth_gender,
    on="college",
    suffixes=("_age", "_gender")
).set_index(["college", "age_group", "gender"]).sort_index().reset_index()

> df.head()
    college age_group   gender  growth_age  growth_gender
0   A       18-19       Female  0.753650    0.004030
1   A       18-19       Other   0.753650    0.772802
2   A       19-35       Male    0.140001    0.004030
3   A       19-35       Female  0.140001    0.772802
4   C       19-35       Male    0.831882    0.876803
5   C       19-35       Female  0.831882    0.913343

NB! Note that the merge()-operation defaults to an inner join, which might not be what you want.

From here, you can easily start doing correlation calculations and plots.

Example: Calculate correlation for each college:

df.groupby(["college"])[["growth_age", "growth_gender"]].corr().unstack().iloc[:,1]

Example: Plot relationship between growth rate for age vs. gender for each age/gender/college

import seaborn as sns

sns.relplot(
    data=df,
    x="growth_age",
    y="growth_gender",
    hue="college",
    row="age_group",
    col="gender",
    sizes=100,
)

enter image description here

ViggoTW
  • 36
  • 8
  • This is helpful, thanks! But does the duplication of values across growth_age and growth_gender cause any issues with the correlation? What does unstack() do? – NikG Dec 04 '22 at 05:15
  • Also, is it possible, rather than showing individual scatter plots, to show a correlation matrix with age bands on one axis and gender on the other? – NikG Dec 04 '22 at 05:16
  • Are you coding in a notebook? If so, I would suggest testing out my code, then remove the right-most method and re-run it. Then do it again for the next right-most method. This can be an helpful and educational way of seeing what each step does, like e.g. the unstack :) – ViggoTW Dec 05 '22 at 07:35
  • It would absolutely be possible to create a correlation matrix. The easiest way is to create a dataframe that contains one column for each feature you want to compare correlation against. Then use the build-in `.corr()`-method. I can recommend the second most rated answer in [this](https://stackoverflow.com/questions/29432629/plot-correlation-matrix-using-pandas) thread :) – ViggoTW Dec 05 '22 at 07:38
  • PS! If you think that my original reply answers your first post, feel free to accept it as an answer :) – ViggoTW Dec 05 '22 at 07:39