I have a data frame cat_data
which has a column workclass
:
> cat_data$workclass
[1] "State-gov" "Self-emp-not-inc" "Private" "Private" "Private" ... [ reached getOption("max.print") -- omitted 31561 entries ]
And column y
is
> cat_data$y
[1] "<=50K" "<=50K" "<=50K" "<=50K" "<=50K" "<=50K" "<=50K" ">50K" ">50K" ">50K" ">50K" ">50K" "<=50K" ...[ reached getOption("max.print") -- omitted 31561 entries ]
I wrote a script to prepare for Naive Bayes analysis:
library(dplyr)
workclass <- cat_data %>%
group_by(workclass, y) %>%
summarise(num = n()) %>%
spread(y, num) %>%
ungroup()
It gave me what I want:
> workclass
# A tibble: 9 x 3
workclass `<=50K` `>50K`
<chr> <int> <int>
1 ? 1645 191
2 Federal-gov 589 371
3 Local-gov 1476 617
4 Never-worked 7 NA
5 Private 17733 4963
6 Self-emp-inc 494 622
7 Self-emp-not-inc 1817 724
8 State-gov 945 353
9 Without-pay 14 NA
Since I need to the same data preparation many times and I don't want to rewrite this chunk again and again, I decided to write a function:
get_frequency <- function(column){
cat_data %>%
group_by(column, y) %>%
summarise(num = n()) %>%
spread(y, num) %>%
ungroup()
}
When I tried workclass <- get_frequency(workclass)
, it threw an error:
Error: Column `column` is unknown
How can I fix it?
update: I've been able to fix it.
library(rlang)
get_frequency <- function(column){
column <- enquo(column)
column <- cat_data %>%
group_by(!!column, y) %>%
summarise(num = n()) %>%
spread(y, num) %>%
ungroup()
return(column)
}
> workclass <- get_frequency(workclass)
> workclass
# A tibble: 9 x 3
workclass `<=50K` `>50K`
<chr> <int> <int>
1 ? 1645 191
2 Federal-gov 589 371
3 Local-gov 1476 617
4 Never-worked 7 NA
5 Private 17733 4963
6 Self-emp-inc 494 622
7 Self-emp-not-inc 1817 724
8 State-gov 945 353
9 Without-pay 14 NA
Thanks erveybody!