0

I'm trying to figure out what I'm doing wrong here. Using the following training data I compute some frequencies using dplyr:

group.count     <- c(101,99,4) 
data   <- data.frame(
    by = rep(3:1,group.count),
    y = rep(letters[1:3],group.count))

data %>%  
group_by(by) %>%
summarise(non.miss = sum(!is.na(y)))

Which gives me the outcome I'm looking for. However, when I try to do it as a function:

res0   <- function(x1,x2) {
output = data %>%  
    group_by(x2) %>%
    summarise(non.miss = sum(!is.na(x1)))
}

res0(y,by)

I get an error (index out of bounds). Can anybody tell me what I'm missing?
Thanks on advance.

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245

2 Answers2

0

You can't do this like that in dplyr.

The problem is that you are passing it a NULL object at the moment. by doesn't exist anywhere. Your first thought might be to pass "by" but this won't work with dplyr either. What dplyr is doing here is trying to group_by the variable x2 which is not a part of your data.frame. To show this, make your data.frame as such:

data   <- data.frame(
  x2 = rep(3:1,group.count),
  x1 = rep(letters[1:3],group.count)
)

Then call your function again and it will return the expected output.

stanekam
  • 3,906
  • 2
  • 22
  • 34
  • 1
    Note that there is a solution: change it to `%>% regroup(list(x2))`, along with changing the function call to `res0(y, "by")` – David Robinson Sep 17 '14 at 05:24
0

I suggest changing the name of your dataframe to df.

This is basically what you have done:

df %>%  
  group_by(by) %>%
  summarise(non.miss = sum(!is.na(y)))

which produces this:

#  by non.miss
#1  1        4
#2  2       99
#3  3      101

but to count the number of observations per group, you could use length, which gives the same answer:

df %>%  
  group_by(by) %>%
  summarise(non.miss = length(y))


#  by non.miss
#1  1        4
#2  2       99
#3  3      101

or, use tally, which gives this:

df %>%  
  group_by(by) %>%
  tally

#  by   n
#1  1   4
#2  2  99
#3  3 101

Now, you could put that if you really wanted into a function. The input would be the dataframe. Like this:

res0   <- function(df) {
df %>%  
    group_by(by) %>%
    tally 
}

res0(df)

#       by   n
#1       1   4
#2       2  99
#3       3 101

This of course assumes that your dataframe will always have the grouping column named 'by'. I realize that these data are just fictional, but avoiding naming columns 'by' might be a good idea because that is its own function in R - it may get a bit confusing reading the code with it in.

jalapic
  • 13,792
  • 8
  • 57
  • 87