statsmodels patsy hypothesis testing categorical variable in constraint 'C()'

Question

Hi am running the following model with statsmodel and it works fine.

from statsmodels.formula.api import ols
from statsmodels.iolib.summary2 import summary_col #for summary stats of large tables
time_FE_str = ' + C(hour_of_day) + C(day_of_week) + C(week_of_year)'
weather_2_str = ' +  C(weather_index) + rain + extreme_temperature + wind_speed'
model = ols("activity_count ~ C(city_id)"+weather_2_str+time_FE_str, data=df)
results = model.fit()
print summary_col(results).tables

print 'F-TEST:'
hypotheses = '(C(weather_index) = 0), (rain=0), (extreme_temperature=0), (wind_speed=0)'
f_test = results.f_test(hypotheses)

However, I do not know how to formulate the hypthosis for the F-test if I want to include the categorical variable C(weather_index). I tried all for me imaginable versions but I always get an error.

Did someone face this issue before?

Any ideas?

F-TEST:
Traceback (most recent call last):
  File "C:/VK/scripts_python/predict_activity.py", line 95, in <module>
    f_test = results.f_test(hypotheses)
  File "C:\Users\Niko\Anaconda2\envs\gl-env\lib\site-packages\statsmodels\base\model.py", line 1375, in f_test
    invcov=invcov, use_f=True)
  File "C:\Users\Niko\Anaconda2\envs\gl-env\lib\site-packages\statsmodels\base\model.py", line 1437, in wald_test
    LC = DesignInfo(names).linear_constraint(r_matrix)
  File "C:\Users\Niko\Anaconda2\envs\gl-env\lib\site-packages\patsy\design_info.py", line 536, in linear_constraint
    return linear_constraint(constraint_likes, self.column_names)
  File "C:\Users\Niko\Anaconda2\envs\gl-env\lib\site-packages\patsy\constraint.py", line 391, in linear_constraint
    tree = parse_constraint(code, variable_names)
  File "C:\Users\Niko\Anaconda2\envs\gl-env\lib\site-packages\patsy\constraint.py", line 225, in parse_constraint
    return infix_parse(_tokenize_constraint(string, variable_names),
  File "C:\Users\Niko\Anaconda2\envs\gl-env\lib\site-packages\patsy\constraint.py", line 184, in _tokenize_constraint
    Origin(string, offset, offset + 1))
patsy.PatsyError: unrecognized token in constraint
   (C(weather_index) = 0), (rain=0), (extreme_temperature=0), (wind_speed=0)
    ^

What is that hypothesis intended to mean? For instance, does `C(weather_index)=0` mean that this variable has no effect? Does `rain=0` mean that the variable `rain` is to be constrained to a value of zero. — Bill Bell, Jan 04 '18 at 16:40

score 2 · Accepted Answer · answered Jan 04 '18 at 20:53

2

The methods t_test, wald_test and f_test are for hypothesis test on the parameters directly and not for a entire categorical or composite effect.

Results.summary() shows the parameter names that patsy created for the categorical variables. Those can be used to create contrast or restrictions for the categorical effects.

As alternative anova_lm directly computes the hypothesis test that a term,e.g. A categorical variable has no effect.

answered Jan 04 '18 at 20:53

Josef

21,998
3
54
67

thanks for your comment! So you are saying, to test the joint significance that a set of variables (for example a categorical variable which is a set of binary dummies) cannot be reliably tested with a f_test(). It would make sense as this normally requires to "run two regressions" in my understanding which does not seem to be the case in the command which is called on the results, not the model. how would you run anova_lm() directly? The documentation is unfortunately really cryptic. – Niko Jan 08 '18 at 16:25
second half of this notebook has anova_lm examples http://www.statsmodels.org/devel/examples/notebooks/generated/interactions_anova.html – Josef Jan 08 '18 at 18:32
"cannot be reliably tested with a f_test()" No, f_test or the more general wald_test are designed for joint hypothesis. However, those tests require that the user specifies each sub-hypothesis, e.g. that the coefficient of each dummy variable is zero. In contrast, anova_lm and wald_test_terms http://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.RegressionResults.wald_test_terms.html build the joint restrictions internally. – Josef Jan 08 '18 at 18:37

Bill Bell · Answer 2 · 2018-01-05T23:39:34.603

Leave out the C()!

I tried making an analysis of these data.

Area    Clover_yield    Yarrow_stems
A   19.0    220
A   76.7    20
A   11.4    510
A   25.1    40
A   32.2    120
A   19.5    300
A   89.9    60
A   38.8    10
A   45.3    70
A   39.7    290
B   16.5    460
B   1.8 320
B   82.4    0
B   54.2    80
B   27.4    0
B   25.8    450
B   69.3    30
B   28.7    250
B   52.6    20
B   34.5    100
C   49.7    0
C   23.3    220
C   38.9    160
C   79.4    0
C   53.2    120
C   30.1    150
C   4.0 450
C   20.7    240
C   29.8    250
C   68.5    0

When I used the linear model in the first call to ols shown in the code I eventually ran into the snag you experienced. However, when I omitted mention of the fact that Area assumes discrete levels I was able to calculate F-tests on contrasts.

import pandas as pd
from statsmodels.formula.api import ols

df = pd.read_csv('clover.csv', sep='\s+')
model = ols('Clover_yield ~ C(Area) + Yarrow_stems', data=df)
model = ols('Clover_yield ~ Area + Yarrow_stems', data=df)

results = model.fit()
print (results.summary())

print (results.f_test(['Area[T.B] = Area[T.C], Yarrow_stems=150']))

Here is the output.

Notice that, the summary indicates the names that can be used in formulating contrasts for factors, in our case Area[T.B] and Area[T.C].

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           Clover_yield   R-squared:                       0.529
Model:                            OLS   Adj. R-squared:                  0.474
Method:                 Least Squares   F-statistic:                     9.726
Date:                Thu, 04 Jan 2018   Prob (F-statistic):           0.000177
Time:                        17:26:03   Log-Likelihood:                -125.61
No. Observations:                  30   AIC:                             259.2
Df Residuals:                      26   BIC:                             264.8
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       57.5772      6.337      9.086      0.000      44.551      70.603
Area[T.B]        0.3205      7.653      0.042      0.967     -15.411      16.052
Area[T.C]       -0.5432      7.653     -0.071      0.944     -16.274      15.187
Yarrow_stems    -0.1086      0.020     -5.401      0.000      -0.150      -0.067
==============================================================================
Omnibus:                        0.459   Durbin-Watson:                   2.312
Prob(Omnibus):                  0.795   Jarque-Bera (JB):                0.449
Skew:                           0.260   Prob(JB):                        0.799
Kurtosis:                       2.702   Cond. No.                         766.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
<F test: F=array([[ 27873807.59795523]]), p=4.939796675253845e-83, df_denom=26, df_num=2>

As mentioned in my comment, I'm not clear what you intended to test.

Edit: Prompted by user333700's comment, I've tried running this code again, with two slighly different statements, model = ols('Clover_yield ~ C(Area) + Yarrow_stems', data=df) and print (results.f_test(['C(Area)[T.B] = C(Area)[T.C], Yarrow_stems=150'])). 'C(Area)[T.B]' and 'C(Area)[T.C]' came from the summary with the changed model. Thus, it doesn't matter, for this type of analysis, whether you declare with C() or not. You must simply remember to use the appropriate form for the dummy variables, as mentioned in the summary.

If the values of a variable are strings, then they are automatically treated as categorical in the formulas. C(Area) and Area should produce the same designmatrix and parameter names. — Josef, Jan 05 '18 at 01:00
@user333700: Thank you, yes, I had noticed that. However, when I put 'C(Area)' in the model formula and then tried to specify a contrast as in the last line of my code the attempt was rejected. — Bill Bell, Jan 05 '18 at 17:15
Did you check and use the names that patsy created? The hypothesis are **supposed** to work with the provided names, but I don't really know the pattern that patsy uses to create names for categorical effects, I always need to check summary() or the internally stored names or formula design info. — Josef, Jan 05 '18 at 22:15
@user333700: I just tried it *one more time*! I'll edit my answer. This has been arduous. — Bill Bell, Jan 05 '18 at 23:33
Just a final comment: We need to use C(...) if the variable is numeric, e.g. integers, but we want to treat them as categorical. For strings C(..) is redundant, except, as seen here, it affects the names that patsy creates. — Josef, Jan 06 '18 at 00:46
I agree with @user333700, leaving out the C() does not do the trick if the variable is continuous (which was the case in both, Bill's and my example). Maybe the solution is to simply redefine the variable as a string and then leave out the C(). — Niko, Jan 08 '18 at 16:29
@Niko: I would now say that I should have read [the patsy documentation](https://patsy.readthedocs.io/en/v0.1.0/formulas.html) much more carefully immediately before attempting to answer your question! :) Although the approach I offered works in the situation offered in your question I now believe that it's usually (always?) better to use C() and then consult the summary() to see how patsy has named the dummy variables. But I probably need more experience with more models. — Bill Bell, Jan 08 '18 at 16:41
Hi Bill, the issue with the results.summary() is that if you have a lot of dummies it stops working and does not print the variable names. Also it seems quite unhandy to type hundreds of variables in the constraint. — Niko, Jan 08 '18 at 18:05
@Niko If you have a large number of constraints, then it is often easier to directly build the constraint matrix instead of using the constraints defined by strings. — Josef, Jan 08 '18 at 18:41
@Niko: I haven't been confronted with anything like that, so have not encountered the problem. Since I can't type more than a few characters without making a mistake it wouldn't be long before I'd be scurrying around for an alternative. — Bill Bell, Jan 08 '18 at 18:48

statsmodels patsy hypothesis testing categorical variable in constraint 'C()'

2 Answers2

Linked