1

I am trying to define a subset of a dataframe for a standard lm model using a "for loop". In the subset expression, I want to refer to col1 using paste and subset all observations where col1-3 is not NA. I have tried the following things, but they do not work:

    for(i in 1:3) {
lm(y ~ x1 + x2, data=subset(df, x3="Y" & !is.na(paste0("col", i))))
}

OR define the colname separately:

    for(i in 1:3) {
colname <- as.name(paste0("col", i))
lm(y ~ x1 + x2, data=subset(df, x3="Y" & !is.na(colname)))
}

BTW: This is a simplified code to illustrate what I am trying to do. The code in my script does not give an error but ignores the !is.na condition of the subset expression. However, it works if done manually like this:

    lm(y ~ x1 + x2, data=subset(df, x3="Y" & !is.na(col1)))

I would greatly appreciate some advice!

Thanks in advance!

FK

F-G-K
  • 13
  • 4
  • 3
    try to substitute to x3 **==** "Y", `x3 = "Y" ` is assignment, not a logical operation – Alexey Jul 02 '18 at 18:03
  • Thanks! That's what I had, I just got it wrong for the question - the original question remains though... – F-G-K Jul 02 '18 at 18:27
  • Please share sample of your data using `dput()` (not `str` or `head` or picture/screenshot) so others can help. See more here https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example?rq=1 – Tung Jul 02 '18 at 18:47
  • Hi Tung, sure will do next time. BR, – F-G-K Jul 03 '18 at 08:27

1 Answers1

1

The is.na() portion is being "ignored" because what you think is being evaluated isn't what is being evaulated. What is being evaluated is:

!is.na("col1")

and the string "col1" is obviously not NA, so it evaluates to TRUE and is recycled for all rows in your data. The issue you are having is you have a variable name stored as a string, and subset() needs a logical vector. So you need a way to use your variable name stored in a string and use it to get the corresponding evaluated logical vector that subset() needs. You can update your code to use something along the lines of:

for(i in 1:3) {
  lm(y ~ x1 + x2, data=subset(df, x3=="Y" & !is.na(df[[paste0("col", i)]])))
}

While this isn't optimal, and there are other ways you can and probably should update your code. Something along the lines of:

for(i in 1:3) {
  lm(y ~ x1 + x2, data = df,
    subset = df$x3 == "Y" & !is.na(df[[paste0("col", i)]]))
}

is a bit cleaner as it uses the subset argument to subset your data.

You still have the issue of you're not storing the results of your call to lm() anywhere.

MHammer
  • 1,274
  • 7
  • 12