How to find correlation between two categorical variable num_chicken_pox and how many time vaccine given

Question

The problem is how to find out the correlation between two categorical [series] items? the situation is like that i have to find out the correlation between HAVING_CPOX and NUM_VECILLA_veccine Given among children the main catch is that in HAVING CPOX COLUMNS have 4 unique value

1-Having cpox
2-not having cpox
99- may be NULL
7 i don't know

in df['P_NUMVRC'] : unique value is [1, 2, 3, 0, Nan,] two different distinct series SO how do find put them together and find the correlation I use value_counts for having frequency of each?

1 13781
2 213
3 1
Name: P_NUMVRC, dtype: int64 For having_cpox columns

2 27955

1 402

77 105

99 3 Name: HAD_CPOX, dtype: int64

the requirement is like this

A positive correlation (e.g., corr > 0) means that an increase in had _ch ickenpox_column (which means more no’s) would also increase the values of um_chickenpox_vaccine_column (which means more doses of vaccine). If there is a negative correlation (e.g., corr < 0), it indicates that having had chickenpox is related to an increase in the number of vaccine doses.

Does this answer your question? [Use .corr to get the correlation between two columns](https://stackoverflow.com/questions/42579908/use-corr-to-get-the-correlation-between-two-columns) — Trenton McKinney, Sep 30 '20 at 05:39
Please see [How to provide a reproducible copy of your DataFrame using `df.head(30).to_clipboard(sep=',')`](https://stackoverflow.com/questions/52413246), then **[edit] your question**, and paste the clipboard into a code block. Always provide a [mre] **with code, data, errors, current output, and expected output, as text**. If relevant, plot images are okay. — Trenton McKinney, Oct 01 '20 at 16:23

score 0 · Answer 1 · answered Sep 30 '20 at 05:40

0

I think what you are looking for is using np.corrcoef. It receives two (in your case - 1 dimensional) arrays, and returns the Pearson Correlation (for more details see: https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html). So basically:

valid_df = df.query('HAVING_CPOX < 3')
valid_df['HAVING_CPOX'].apply(lambda x: x == 1, inplace=True)
corr = np.corrcoef(valid_df['HAVING_CPOX'], valid_df['P_NUMVRC'])

What I did is first get rid of the 99's and 7's since you can't really rely on those. Then I changed the HAVING_CPOX to be binary (0 is "has no cpox" and 1 is "has cpox"), so that the correlation makes sense. Then I used corrcoef from numpy's implementation.

answered Sep 30 '20 at 05:40

yonatansc97

584
6
16

() got an unexpected keyword argument 'inplace' this line pop up every_time i run the program as you said, But , if i remove the inplace it gives me the BOOLEAN values of series, How do i apply the boolen value into correlation, ANd alternatively if i give the code like – Md Sani Oct 01 '20 at 11:55
df['HAD_C'] =df[df['HAD_C'] == 1 ].HAD_C but it gives me the error and give me a caveat that a copy sth can not be modified and recommend me to see py document – Md Sani Oct 01 '20 at 11:56

How to find correlation between two categorical variable num_chicken_pox and how many time vaccine given

1 Answers1