Dummy Variable in R

Question

Ciao Everyone,

I would like to create a dummy variable in R. So I have a list of Italian regions, and a variable called mafia. The mafia variable is coded 1 in the regions with high levels of mafia infiltration and 0 in the regions with lower levels of mafia penetration.

Now, I would like to create a dummy that considers only the regions with high levels of mafia. (=1)

Welcome to StackOverflow. Please take a look at these tips on how to produce a [minimum, complete, and verifiable example](http://stackoverflow.com/help/mcve), as well as this post on [creating a great example in R](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Perhaps the following tips on [asking a good question](http://stackoverflow.com/help/how-to-ask) may also be worth a read. — lmo, Mar 02 '17 at 17:14

dleal · Answer 1 · 2017-03-02T17:36:18.707

If I understand your question correctly, the typical way of adding dummy variables (also called fixed effects) is to use the function factor. Here is a an example that creates random data and then uses factor in a linear regression:

set.seed(1)
require(data.table)
A = data.table(region = LETTERS[0:3], y = runif(100), x = runif(100), mafia = sample(c(0,1),100,rep = T))
> head(A)
   region        var mafia
1:      A 0.67371223     1
2:      B 0.09485786     0
3:      C 0.49259612     1
4:      A 0.46155184     1
5:      B 0.37521653     1
6:      C 0.99109922     1

formula = y ~ x + factor(mafia)

reg <- lm(formula, data = A)

> summary(reg)

Call:
lm(formula = formula, data = A)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.46965 -0.24828 -0.03362  0.28780  0.51183 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.46196    0.07093   6.513 3.28e-09 ***
x               0.06735    0.10521   0.640    0.524    
factor(mafia)1 -0.01830    0.06415  -0.285    0.776    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3189 on 97 degrees of freedom
Multiple R-squared:  0.005498,  Adjusted R-squared:  -0.01501 
F-statistic: 0.2681 on 2 and 97 DF,  p-value: 0.7654

If you wish to only do a regression on the observations that are coded with 1 in the "mafia" column, this is much easier:

# Note that A is a data.table
A.mafia = A[ mafia == 1 ]
formula = y ~ x
reg <- lm(formula, data = A.mafia)
summary(reg)

Output:

Call:
    lm(formula = formula, data = A.mafia)

    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.47163 -0.26063 -0.05724  0.30166  0.52062 

    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  0.43334    0.07926   5.467 1.53e-06 ***
    x            0.09017    0.14474   0.623    0.536    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 0.3197 on 49 degrees of freedom
    Multiple R-squared:  0.007857,  Adjusted R-squared:  -0.01239 
    F-statistic: 0.388 on 1 and 49 DF,  p-value: 0.5362

Thank you, I am not sure. I would like to have a dummy that just considers the regions coded 1 and not the regions that are coded 0 in relation to mafia. — 156158, Mar 02 '17 at 17:28
do you mean that you just want to select the observations that have 1 in the mafia column and then do a regression on only those? — dleal, Mar 02 '17 at 17:29
So, I would like to do an interaction with the dummy and other variables. — 156158, Mar 02 '17 at 17:41
Then it would have to be this regression: y ~ x + factor(mafia) + I(x*mafia) — dleal, Mar 02 '17 at 17:43
Exactly, but I would like to create a dummy-mafia variable that only includes the regions that are coded 1 — 156158, Mar 02 '17 at 17:44
Since the mafia dummy is coded 1,0, you should NOT include only the observations that only have 1 because then you cannot do a dummy variable regression — dleal, Mar 02 '17 at 17:47

Dummy Variable in R

1 Answers1