Use paste and !is.na to subset data frame

Question

I am trying to define a subset of a dataframe for a standard lm model using a "for loop". In the subset expression, I want to refer to col1 using paste and subset all observations where col1-3 is not NA. I have tried the following things, but they do not work:

    for(i in 1:3) {
lm(y ~ x1 + x2, data=subset(df, x3="Y" & !is.na(paste0("col", i))))
}

OR define the colname separately:

    for(i in 1:3) {
colname <- as.name(paste0("col", i))
lm(y ~ x1 + x2, data=subset(df, x3="Y" & !is.na(colname)))
}

BTW: This is a simplified code to illustrate what I am trying to do. The code in my script does not give an error but ignores the !is.na condition of the subset expression. However, it works if done manually like this:

    lm(y ~ x1 + x2, data=subset(df, x3="Y" & !is.na(col1)))

I would greatly appreciate some advice!

Thanks in advance!

FK

try to substitute to x3 **==** "Y", `x3 = "Y" ` is assignment, not a logical operation — Alexey, Jul 02 '18 at 18:03
Thanks! That's what I had, I just got it wrong for the question - the original question remains though... — F-G-K, Jul 02 '18 at 18:27
Please share sample of your data using `dput()` (not `str` or `head` or picture/screenshot) so others can help. See more here https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example?rq=1 — Tung, Jul 02 '18 at 18:47

MHammer · Accepted Answer · 2018-07-02T19:33:55.577

The is.na() portion is being "ignored" because what you think is being evaluated isn't what is being evaulated. What is being evaluated is:

!is.na("col1")

and the string "col1" is obviously not NA, so it evaluates to TRUE and is recycled for all rows in your data. The issue you are having is you have a variable name stored as a string, and subset() needs a logical vector. So you need a way to use your variable name stored in a string and use it to get the corresponding evaluated logical vector that subset() needs. You can update your code to use something along the lines of:

for(i in 1:3) {
  lm(y ~ x1 + x2, data=subset(df, x3=="Y" & !is.na(df[[paste0("col", i)]])))
}

While this isn't optimal, and there are other ways you can and probably should update your code. Something along the lines of:

for(i in 1:3) {
  lm(y ~ x1 + x2, data = df,
    subset = df$x3 == "Y" & !is.na(df[[paste0("col", i)]]))
}

is a bit cleaner as it uses the subset argument to subset your data.

You still have the issue of you're not storing the results of your call to lm() anywhere.

Thanks, MHammer! That was very helpful - sorry if this was trivial. — F-G-K, Jul 03 '18 at 08:26

Use paste and !is.na to subset data frame

1 Answers1