0

The problem is how to find out the correlation between two categorical [series] items? the situation is like that i have to find out the correlation between HAVING_CPOX and NUM_VECILLA_veccine Given among children the main catch is that in HAVING CPOX COLUMNS have 4 unique value

  • 1-Having cpox
  • 2-not having cpox
  • 99- may be NULL
  • 7 i don't know

in df['P_NUMVRC'] : unique value is [1, 2, 3, 0, Nan,] two different distinct series SO how do find put them together and find the correlation I use value_counts for having frequency of each?

  • 1 13781

  • 2 213

  • 3 1

  • Name: P_NUMVRC, dtype: int64 For having_cpox columns

2 27955

1 402

77 105

99 3 Name: HAD_CPOX, dtype: int64

the requirement is like this

A positive correlation (e.g., corr > 0) means that an increase in had _ch ickenpox_column (which means more no’s) would also increase the values of um_chickenpox_vaccine_column (which means more doses of vaccine). If there is a negative correlation (e.g., corr < 0), it indicates that having had chickenpox is related to an increase in the number of vaccine doses.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Md Sani
  • 1
  • 2
  • Does this answer your question? [Use .corr to get the correlation between two columns](https://stackoverflow.com/questions/42579908/use-corr-to-get-the-correlation-between-two-columns) – Trenton McKinney Sep 30 '20 at 05:39
  • No, i tried that already – Md Sani Oct 01 '20 at 10:45
  • Please see [How to provide a reproducible copy of your DataFrame using `df.head(30).to_clipboard(sep=',')`](https://stackoverflow.com/questions/52413246), then **[edit] your question**, and paste the clipboard into a code block. Always provide a [mre] **with code, data, errors, current output, and expected output, as text**. If relevant, plot images are okay. – Trenton McKinney Oct 01 '20 at 16:23

1 Answers1

0

I think what you are looking for is using np.corrcoef. It receives two (in your case - 1 dimensional) arrays, and returns the Pearson Correlation (for more details see: https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html). So basically:

valid_df = df.query('HAVING_CPOX < 3')
valid_df['HAVING_CPOX'].apply(lambda x: x == 1, inplace=True)
corr = np.corrcoef(valid_df['HAVING_CPOX'], valid_df['P_NUMVRC'])

What I did is first get rid of the 99's and 7's since you can't really rely on those. Then I changed the HAVING_CPOX to be binary (0 is "has no cpox" and 1 is "has cpox"), so that the correlation makes sense. Then I used corrcoef from numpy's implementation.

yonatansc97
  • 584
  • 6
  • 16
  • () got an unexpected keyword argument 'inplace' this line pop up every_time i run the program as you said, But , if i remove the inplace it gives me the BOOLEAN values of series, How do i apply the boolen value into correlation, ANd alternatively if i give the code like – Md Sani Oct 01 '20 at 11:55
  • df['HAD_C'] =df[df['HAD_C'] == 1 ].HAD_C but it gives me the error and give me a caveat that a copy sth can not be modified and recommend me to see py document – Md Sani Oct 01 '20 at 11:56