R different behavior when accessing columns from within function as opposed to interactively

Question

I have a data frame named granular that contains, in relevant part:

factor column GranularClass, one of whose values is "Constitutional Law I Spring 2016", and
several numeric columns, for example Knowledge. The numeric columns contain NAs.

I'm trying to write a function that counts the non-NA values for a given column, conditional on a given factor value. However, my attempt to count the values behaves differently depending on whether I write it as a function or just use it in the console.

More specifically, the following code fails:

# take subset of the dataframe containing only the factor values I want to look at:
isolate <- function(class) {
  return(granular[granular$GranularClass == class, ])
}

# count non-NA values:
cr <- function(df, column){
  return(sum(!is.na(df$column)))
}

# this fails
cr(isolate("Constitutional Law I Spring 2016"), Knowledge)

That last call gives incorrect output (it just returns 0), and throws a warning:

Warning message:
In is.na(df$column) :
  is.na() applied to non-(list or vector) of type 'NULL'

However, this succeeds:

sum(!is.na(isolate("Constitutional Law I Spring 2016")$Knowledge))

# gives correct output: [1] 62

And, so... huh? I believe that the working code in the last block is semantically identical to the function call in the first block that blows up. But obviously that's not right.

Am I somehow passing the column name into the function wrong? (Should I be passing it as a string? But this prior SO suggests you can't pass strings into the $ operator.

score 1 · Accepted Answer · edited May 23 '17 at 12:18

Now that I've written two paragraphs as comments, I'll make them an answer:

$ doesn't evaluate/parse the column name that follows. If you want to use a variable column name, the easiest way is a string column name with [ or [[, not $. Try, for example, x = 'mpg', mtcars$mpg; mtcars$x; mtcars[, mpg]; mtcars[, 'mpg']; mtcars[, x]. Note especially that mtcars$x does not return a column, even though x is defined as 'mpg' and there is a column named 'mpg'. This is the root of your problem and the main point of the question that you link to, Select a data frame column using $ and the name of the column as a string in a variable. It doesn't matter that you are using $ inside a function.

See also fortunes::fortune(312) and fortunes::fortune(343).

But mtcars$x doesn't throw an error - it returns NULL because there is no column named 'x'. So the differences in behavior you are observing are because you are doing different things with the results and it is the downstream calls that throw errors. is.na(NULL) gives a warning and a 0-length result - which is summable to 0. But no error here.

Your isolate function is weird because it relies on having a data frame in the global environment named granular with a column named GranularClass. It would be better practice to pass in the data frame, but whatever. This doesn't much matter unless this function is in a package being submitted to CRAN.

When you do df$column, even though column is an argument to your cr function, because it is being used here as an argument to $ and $ is special, column is not evaluated.

For your function to work, you should rewrite it to be

cr = function(df, column) sum(!is.na(df[, column]))

and call it as

cr(isolate("Constitutional Law I Spring 2016"), "Knowledge")

Using strings as column names is the only straightforward way to pass column names as arguments.

If you really want to pass unquoted column names, then use the lazyeval package. Its vignette is very good. But it will have write the standard-evaluating version as above, and write a non-standard evaluating wrapper around it. It's generally not worth the hassle.

awesome, thanks so much! I knew something had to be wonky with that $ operator, but it's such a dark corner of the language I had no clue how to fix. (And yeah, grabbing global variables is a big naughty, but this is just a personal hack script, so hey. :-) ) — Paul Gowder, Aug 24 '16 at 20:30

R different behavior when accessing columns from within function as opposed to interactively

1 Answers1

Linked