I am trying to calculate the Chi square value in python, using a contingency table. Here is an example.
+--------+------+------+
| | Cat1 | Cat2 |
+--------+------+------+
| Group1 | 80 | 120 |
| Group2 | 420 | 380 |
+--------+------+------+
The expected values are:
+--------+------+------+
| | Cat1 | Cat2 |
+--------+------+------+
| Group1 | 100 | 100 |
| Group2 | 400 | 400 |
+--------+------+------+
If I calculate the Chi square value by hand I get 10. With python however I get 9.506. I use the following code:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency
import scipy
# Some fake data.
n = 5 # Number of samples.
d = 3 # Dimensionality.
c = 2 # Number of categories.
data = np.random.randint(c, size=(n, d))
data = pd.DataFrame(data, columns=['CAT1', 'CAT2', 'CAT3'])
# Contingency table.
contingency = pd.crosstab(data['CAT1'], data['CAT2'])
contingency.iloc[0][0]=80
contingency.iloc[0][1]=120
contingency.iloc[1][0]=420
contingency.iloc[1][1]=380
# Chi-square test of independence.
chi, p, dof, expected = chi2_contingency(contingency)
It is weird that the function gives me the correct expected values, however the Chi square and p-value are off. What am I doing wrong here?
Thanks
p.s.
I am aware that I create the initial table in pandas is pretty lame, but I am not an expert on how to create these nested tables in pandas.