0

I have a data frame cat_data which has a column workclass:

> cat_data$workclass
   [1] "State-gov"        "Self-emp-not-inc" "Private"          "Private"          "Private"    ... [ reached getOption("max.print") -- omitted 31561 entries ]

And column y is

> cat_data$y
   [1] "<=50K" "<=50K" "<=50K" "<=50K" "<=50K" "<=50K" "<=50K" ">50K"  ">50K"  ">50K"  ">50K"  ">50K"  "<=50K"   ...[ reached getOption("max.print") -- omitted 31561 entries ]

I wrote a script to prepare for Naive Bayes analysis:

library(dplyr)
workclass <- cat_data %>%
  group_by(workclass, y) %>%
  summarise(num = n()) %>%
  spread(y, num) %>%
  ungroup()

It gave me what I want:

> workclass
# A tibble: 9 x 3
  workclass        `<=50K` `>50K`
  <chr>              <int>  <int>
1 ?                   1645    191
2 Federal-gov          589    371
3 Local-gov           1476    617
4 Never-worked           7     NA
5 Private            17733   4963
6 Self-emp-inc         494    622
7 Self-emp-not-inc    1817    724
8 State-gov            945    353
9 Without-pay           14     NA

Since I need to the same data preparation many times and I don't want to rewrite this chunk again and again, I decided to write a function:

get_frequency <- function(column){
  cat_data %>%
    group_by(column, y) %>%
    summarise(num = n()) %>%
    spread(y, num) %>%
    ungroup()
}

When I tried workclass <- get_frequency(workclass), it threw an error:

Error: Column `column` is unknown

How can I fix it?

update: I've been able to fix it.

library(rlang)
get_frequency <- function(column){
  column <- enquo(column)
  column <- cat_data %>%
    group_by(!!column, y) %>%
    summarise(num = n()) %>%
    spread(y, num) %>%
    ungroup()
  return(column)
}
> workclass <- get_frequency(workclass)
> workclass
# A tibble: 9 x 3
  workclass        `<=50K` `>50K`
  <chr>              <int>  <int>
1 ?                   1645    191
2 Federal-gov          589    371
3 Local-gov           1476    617
4 Never-worked           7     NA
5 Private            17733   4963
6 Self-emp-inc         494    622
7 Self-emp-not-inc    1817    724
8 State-gov            945    353
9 Without-pay           14     NA

Thanks erveybody!

Community
  • 1
  • 1
vincent
  • 307
  • 1
  • 2
  • 11
  • 1
    Please add reproducible sample data; this is a matter of properly quoting/unquoting `column` inside your function to adhere to `dplyr`'s non-standard evaluation. – Maurits Evers Feb 26 '19 at 02:45
  • Add a `dput` of data. What is `y` in your function? – NelsonGon Feb 26 '19 at 02:45
  • To write a function that takes a column name as an argument like this, you'll need to do some [tidyeval](https://dplyr.tidyverse.org/articles/programming.html) – camille Feb 26 '19 at 03:14

1 Answers1

0

Mauritus Evers is correct and their are a few ways of doing this. My preferred method (and seems most correct from reading up on it) is to use the !! operator from rlang package, it works well and also has a !!! for unquoting and splicing a vector of arguments for evaluation which comes in handy if you wanted to pass multiple columns to be evaluated in group_by. This unquotes the argument and evaluates it in the surrounding environment.

Some of the ways I have done it in the past:

  • !! operator from rlang: This unquotes the argument and evaluates it in the surrounding environment
  • eval(parse(text = "column")): Which as the call would suggest, evaluates the parsed argument in the surrounding environment.
  • Last method is to use group_by_ function which is the SE (standard evaluation) versions of the dplyr verb group_by which allows for exactly what Mauritus Evers referred to.

Bearing in mind that it matters whether you pass a string "column" or object into your function. But play around with those options and you'd get it working in the way you prefer.

Example:

get_frequency <- function(column){
  cat_data %>%
    group_by(!! sym(column), y) %>%
    summarise(num = n()) %>%
    spread(y, num) %>%
    ungroup()
}
get_frequency("column")

Alternatively if you would like to not pass a string

get_frequency <- function(column){
  cat_data %>%
    group_by(!! enquo(column), y) %>%
    summarise(num = n()) %>%
    spread(y, num) %>%
    ungroup()
}

get_frequency(column)
Croote
  • 1,382
  • 1
  • 7
  • 15
  • 1
    Please edit to show how you would actually change up the function to make it work. The text is good but an executable example is great! Also note:`group_by_() is deprecated. Please use group_by() instead` – NelsonGon Feb 26 '19 at 03:32
  • Thanks! This is informative. – vincent Feb 26 '19 at 16:38