0

It seems like one of the primary things I get stuck on when R programming is passing through variable names. I come from a Stata background, where we can easily call globals with "$" in any code or function. However, that doesn't seem to work in R. It seems like sometimes I have to use some special package or use something like df[[x]] or something like that. Instead of doing all of this ad-hoc, I was wondering if someone can walk me through the R architecture so I understand how to address this problem every time I run into it.

As a simple example, I am currently working on a code that stores a row count:

rowcount <- function(x){
all_n <- length(which(!is.na(df$x) & df$model=="Honda"))
print(all_n)
}

The function simply stores the count of rows when x is not missing and make is "Honda". I want to be able to pass the variable name into the function, then have it return this count. For instance, for variable gender, I want to be able to write rowcount(gender)', and for gender to be passed into the function asdf$gender'. However, this doesn't happen.

Can someone explain how to fix this code, and in the process, how I can generally fix these types of problems? I know there may be more elegant ways to achieve my goal, but my intention is both to (1) get a code that fulfills a specific goal for my project, and (2) more generally understand how R treats variable names as arguments in functions.

Thanks

MrFlick
  • 195,160
  • 17
  • 277
  • 295
Hutchins
  • 31
  • 3
  • You might also be interested in this chapter of Advanced R: http://adv-r.had.co.nz/Computing-on-the-language.html – MrFlick Sep 04 '19 at 14:29
  • Its a matter of programming scope. It is generally encouraged to keep functions self contained. That means not using or editing global variables. This is done to prevent mistakes across different scopes. This becomes more noticeable when you have nested loops with functions calling global variables, or when you have recursive functions (functions that call themselves). If you want to use a variable name in a function pass the data and the column name and then use MrFlick's link to see how to do it. – Adam Sampson Sep 04 '19 at 14:33

1 Answers1

1

We can pass the column name as string and then uses [[. It is better to have the data also as an argument in the function so that it can be reused for different datasets

rowcount <- function(data, x){
    all_n <- length(which(!is.na(data[[x]] & model=="Honda"))
    all_n
}

Note that print only prints the output. We need to return the object created. In R, we don't have to explicitly specify the return


In addition to the OP's method, it can also be done with sum

rowcount <- function(data, x){
    sum(!is.na(data[[x]] & model=="Honda")

}

Note that we don't have to create an object and then return if it is a single expression


As an aside, the tidyverse option would be

library(dplyr)
rowcount <- function(data, x) {
     x <- enquo(x)
     data %>%
        summarise(out = sum(!is.na(!!x) & model == "Honda")) %>%
        pull(out)
 }

where we can pass the column name unquoted

rowcount(df1, columnname)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Your first code doesn't work. I get the error that whatever I specify as x cannot be found – Hutchins Sep 04 '19 at 14:48
  • @Hutchins You need to specify the column name as string. If `mtcars` is the dataset `rowcount(mtcars, "mpg")` Of course, the 'model' is not a column in the data, so have to make those adjustments and run it – akrun Sep 04 '19 at 14:49
  • Thanks. That worked. However, while I understand how to address this particular problem, I still don't understand the general theory. Why can't I just use `df$x' and pass it in that? Why must I use quotes? – Hutchins Sep 04 '19 at 14:52
  • @Hutchins Becausee `df$x` is looking for the particular 'x' string as column name instead of the value passed into tthe 'x' – akrun Sep 04 '19 at 14:53