2

I am trying to use a data.frame twice in a dplyr chain. Here is a simple example that gives an error

df <- data.frame(Value=1:10,Type=rep(c("A","B"),5))

df %>% 
  group_by(Type) %>% 
  summarize(X=n())  %>% 
  mutate(df %>%filter(Value>2) %>%  
  group_by(Type) %>%  
  summarize(Y=sum(Value)))

Error: cannot handle

So the idea is that first a data.frame is created with two columns Value which is just some data and Type which indicates which group the value is from.

I then try to use summarize to get the number of objects in each group, and then mutate, using the object again to get the sum of the values, after the data has been filtered. However I get the Error: cannot handle. Any ideas what is happening here?

Desired Output:

Type X Y
  A  5 24
  B  5 28
John Paul
  • 12,196
  • 6
  • 55
  • 75

2 Answers2

6

You could try the following

df %>% 
  group_by(Type) %>% 
  summarise(X = n(), Y = sum(Value[Value > 2]))

# Source: local data frame [2 x 3]
# 
#   Type X  Y
# 1    A 5 24
# 2    B 5 28

The idea is to filter only Value by the desired condition, instead the whole data set


And a bonus solution

library(data.table)
setDT(df)[, .(X = .N, Y = sum(Value[Value > 2])), by = Type]
#    Type X  Y
# 1:    A 5 24
# 2:    B 5 28

Was going to suggest that to @nongkrong but he deleted, with base R we could also do

aggregate(Value ~ Type, df, function(x) c(length(x), sum(x[x>2])))
#   Type Value.1 Value.2
# 1    A       5      24
# 2    B       5      28
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • Thanks. I went the `ifelse` route just because my actual conditions were more complex and I found the code easier to understand, but your multiple solutions are very informative. – John Paul Aug 13 '15 at 19:32
  • That's fine, though `ifelse` is usually not very efficient for big data sets. Most of the problem can be solved without it (from my experience at least) – David Arenburg Aug 13 '15 at 19:33
  • Jumping on David's comment, I totally agree. For big datasets, I prefer `data.table`, although even there I run into some long vector issues with reshaping. – bjoseph Aug 13 '15 at 19:38
  • @bjoseph it's not about `data.table` vs `dplyr` rather just `ifelse`. See [here](http://stackoverflow.com/questions/16275149/does-ifelse-really-calculate-both-of-its-vectors-every-time-is-it-slow) – David Arenburg Aug 13 '15 at 19:39
  • @bjoseph You folks convinced me - I rewrote and removed the `ifelse` , I should learn not to be lazy – John Paul Aug 13 '15 at 19:42
3

This is also pretty easy to do with ifelse()

df %>% group_by(Type) %>% summarize(X=n(),y=sum( ifelse(Value>2, Value, 0 )))

outputs:

Source: local data frame [2 x 3]

  Type X  y
1    A 5 24
2    B 5 28
bjoseph
  • 2,116
  • 17
  • 24
  • This worked great - my actual problem had somewhat more complicated conditions and this was easier to adapt for me. – John Paul Aug 13 '15 at 19:30
  • 1
    I find `dplyr` syntax confusing too, and I make extensive use of `ifelse` inside `dplyr` statements. It works in a surprising amount of cases. – bjoseph Aug 13 '15 at 19:32