Linear regression in R with if statement

Question

I have a dummy variable black where black==0 is White and black==1 is Black. I am trying to fit a linear model lm for the black==1 category only, however running the code below gives me the incorrect coefficients. Is there a way in R to run a model with the if statement, similar to Stata?

library(foreign)
df<-read.dta("hw4.dta")
attach(df)
black[black==0]<-NA
model3<-lm(rent~I(income^2)+income+black)

josliber · Answer 1 · 2014-03-01T00:16:19.677

3

If looks like there are a few issues here. First, you've stored all your data in separate vectors rent, income and black. You should instead store it in a data frame:

data <- data.frame(rent, income, black)

To limit a data frame based on a logical expression, you can use the subset function:

data.limited <- subset(data, black == 1)

Finally, you can run your analysis on your limited data frame (presumably without the black variable):

model3 <- lm(rent~I(income^2)+income, data=data.limited)

edited Mar 01 '14 at 00:16

answered Feb 28 '14 at 18:47

josliber

43,891
12
98
133

also, subset can be used within the lm call --- lm(...,subset=black==1) – Steve Reno Feb 28 '14 at 18:49
I'm slightly confused. I just added some more to my above code. Does this still apply if I have my data attached? – monarque13 Feb 28 '14 at 18:52
2

I think most would agree that using attach() is generally a bad idea. better to leave your data in the data frame df and use df$variable calls for specific variables. model3<-lm(df$rent~I(df$income^2)+df$income,subset=df$black==1) should provide the results you're looking for – Steve Reno Feb 28 '14 at 18:55
Others who are wiser than I have suggested to avoid using subset() in code (more for on-the-fly in the console), so I've tried to get in the habit of just using '['. Thus: `lm(rent~I(income^2)+income, data=data[data[,"black"]==0,])` – rbatt Feb 28 '14 at 19:25
I think my coding scheme is messed up because I cleaned my categories in Stata before importing the data set into R. None of the suggestions seem to work because `levels(black)` reveals `[1] "White" "Black"`. Not sure how to remedy this. – monarque13 Feb 28 '14 at 19:35
@rbatt `subset` is fine for interactive use. It is better to avoid it inside functions and loops and just stick with `[` – rawr Feb 28 '14 at 19:43
@user3339295 Well, if you have a factor simply use `data.limited <- df[df$black=="White",]`. – Roland Feb 28 '14 at 20:59
I hadn't heard anything negative about using `subset` in code, so I'm interested in hearing more about this. I get that it creates a new copy of part of my data, which is some cases is inefficient. However, would there be any benefit here of using `data.limited <- df[df$black == 0,]` instead of `data.limited <- subset(data, black == 0)`? Could you clarify the cases in which it's best to avoid `subset`? – josliber Mar 01 '14 at 00:15
1

@josilber http://stackoverflow.com/q/9860090/1412059 – Roland Mar 01 '14 at 13:47

eclark · Answer 2 · 2014-03-01T00:23:43.557

3

Why not subset the data before running the model? I personally prefer using a dataframe rather than separate vectors which will make the subsetting easier.

df <- data.frame(rent, income, black)

Then subset the dataframe, o create another one

df <- df[df$black==1,]

And run the model

model3 <- lm(rent ~ I(income^2) , data=df)

edited Mar 01 '14 at 00:23

answered Feb 28 '14 at 18:51

eclark

819
7
16

With the added lines of codes you can do model3<-lm(rent~I(income^2)+income+black, na.action=na.omit) – eclark Feb 28 '14 at 18:55
1

probably best not to use the `black` variable in the new model, since it will be constant in the limited data frame. – josliber Mar 01 '14 at 00:17
That's right, oversight on my part. Thanks! I'll edit it. – eclark Mar 01 '14 at 00:23

score 2 · Answer 3 · edited Jan 03 '15 at 00:08

2

The code written below should do it.

model3 <- lm(rent~I(income^2)+income+black, data=df, subset=df$black==1))

edited Jan 03 '15 at 00:08

Nikos

3,267
1
25
32

answered Jan 02 '15 at 23:37

Isidro Jr

21
1

you might not even need the `df$` – Ben Bolker Jan 03 '15 at 00:33
Why are you passing data = df into the lm function – user3042850 Mar 14 '16 at 16:31

Linear regression in R with if statement

3 Answers3

Linked

Related