Categorical variables usage in pandas for ANOVA and regression?

Question

To prepare a little toy example:

import pandas as pd
import numpy as np

high, size = 100, 20
df = pd.DataFrame({'perception': np.random.randint(0, high, size),
                   'age': np.random.randint(0, high, size),
                   'outlook': pd.Categorical(np.tile(['positive', 'neutral', 'negative'], size//3+1)[:size]),
                   'smokes': pd.Categorical(np.tile(['lots', 'little', 'not'], size//3+1)[:size]),
                   'outcome': np.random.randint(0, high, size)
                  })
df['age_range'] = pd.Categorical(pd.cut(df.age, range(0, high+5, size//2), right=False,
                             labels=["{0} - {1}".format(i, i + 9) for i in range(0, high, size//2)]))
np.random.shuffle(df['smokes'])

Which will give you something like:

In [2]: df.head(10)
Out[2]:
   perception  age   outlook  smokes  outcome age_range
0          13   65  positive  little       22   60 - 69
1          95   21   neutral    lots       95   20 - 29
2          61   53  negative     not        4   50 - 59
3          27   98  positive     not       42   90 - 99
4          55   99   neutral  little       93   90 - 99
5          28    5  negative     not        4     0 - 9
6          84   83  positive    lots       18   80 - 89
7          66   22   neutral    lots       35   20 - 29
8          13   22  negative    lots       71   20 - 29
9          58   95  positive     not       77   90 - 99

Goal: figure out likelihood of outcome, given {perception, age, outlook, smokes}.

Secondary goal: figure out how important each column is in determining outcome.

Third goal: prove attributes about distribution (here we have randomly generated, so a random distribution should imply the null hypothesis is true?)

Clearly these are all questions findable with statistical hypothesis testing. What's the right way of answering these questions in pandas?

Tempted to just build out a NN for this in TensorFlow. However, I do want to get p-values and all also. So will likely end up with two approaches, the p-value one seems ripe for pandas/statsmodel/numpy/researchpy. How am I meant do this? — A T, May 23 '19 at 02:16
you've asked an important question but now you're digressing from it. Suggest to forget about building models for now and rather focus on **statistically correct approach** for the `categorical variable treatment`. The question can further be enriched by asking **how to measure the interplay between categorical and continuous variables**. Think about it. — mnm, May 23 '19 at 02:38
This sounds like a good use case for [one versus all classification](https://utkuufuk.com/2018/06/03/one-vs-all-classification/). For your predictors you can use pd.get_dummies or one hot encoder from sklearn. — seanswe, May 23 '19 at 06:00
linear regression from statsmodels will give you p-values for each feature. If you're looking for confidence in the regression prediction take a look at this: https://docs.seldon.io/projects/alibi/en/v0.2.0/methods/TrustScores.html, maybe you can adapt it for regression instead of classification — Dan, May 28 '19 at 16:43

score 5 · Answer 1 · edited May 28 '19 at 16:37

Finding out likelihood of outcome given columns and Feature importance (1 and 2)

Categorical data

As the dataset contains categorical values, we can use the LabelEncoder() to convert the categorical data into numeric data.

from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()
df['outlook'] = enc.fit_transform(df['outlook'])
df['smokes'] = enc.fit_transform(df['smokes'])

Result

df.head()

   perception  age  outlook  smokes  outcome age_range
0          67   43        2       1       78     0 - 9
1          77   66        1       1       13     0 - 9
2          33   10        0       1        1     0 - 9
3          74   46        2       1       22     0 - 9
4          14   26        1       2       16     0 - 9

Without creating any model, we can make use of the chi-squared test, p-value and correlation matrix to determine the relation.

Correlation matrix

import matplotlib.pyplot as plt
import seaborn as sns

corr = df.iloc[:, :-1].corr()
sns.heatmap(corr,
            xticklabels=corr.columns,
            yticklabels=corr.columns)
plt.show()

Chi-squared test and p-value

from sklearn.feature_selection import chi2

res = chi2(df.iloc[:, :4], df['outcome'])
features = pd.DataFrame({
    'features': df.columns[:4],
    'chi2': res[0],
    'p-value': res[1]
})

Result

features.head()

     features         chi2        p-value
0  perception  1436.012987  1.022335e-243
1         age  1416.063117  1.221377e-239
2     outlook    61.139303   9.805304e-01
3      smokes    57.147404   9.929925e-01

Randomly generated data, so null hypothesis is true. We can verify this by trying to fit a normal curve to the outcome.

Distribution

import scipy as sp

sns.distplot(df['outcome'], fit=sp.stats.norm, kde=False)
plt.show()

From the plot we can conclude that the data does not fit a normal distribution (as it is randomly generated.)

Note: As the data is all randomly generated, you results can vary, based on the size of the data set.

References

For the categorical data encoding one could also use [`pd.get_dummies()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html). — Stefan Falk, May 29 '19 at 12:52
`get_dummies` only gives me 0 or 1, here I have 3 options. Thanks @skillsmuggler, so I should take it that these [p-value](https://en.wikipedia.org/wiki/P-value#Usage)s indicate that the columns are independent; and that the [χ² test](https://en.wikipedia.org/wiki/Chi-squared_test#Example_chi-squared_test_for_categorical_data) values don't fit the [χ²-distribution](https://en.wikipedia.org/wiki/Chi-squared_distribution), so is unable to reject the [null hypothesis](https://en.wikipedia.org/wiki/Null_hypothesis)? - Finally your correlation matrix shows a strong diagonal, and is therefore true? — A T, May 29 '19 at 23:39
`get_dummies` does [OneHot encoding](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f). It splits a single categorical column into a binary (0/1) column of each factor. — skillsmuggler, May 30 '19 at 06:26
You are right about the `null hypothesis`. The diagonal of the correlation matrix must not be considered. We only consider the `upper or lower triangle` in the correlation matrix. The diagonal elements correspond to the `coefficient of correlation` between the same column and is always `1`. [Reference](https://www.statisticshowto.datasciencecentral.com/correlation-matrix/) — skillsmuggler, May 30 '19 at 06:29

Categorical variables usage in pandas for ANOVA and regression?

1 Answers1