Can you use a data.frame twice in a dplyr chain? dplyr says " Error: cannot handle "

Question

I am trying to use a data.frame twice in a dplyr chain. Here is a simple example that gives an error

df <- data.frame(Value=1:10,Type=rep(c("A","B"),5))

df %>% 
  group_by(Type) %>% 
  summarize(X=n())  %>% 
  mutate(df %>%filter(Value>2) %>%  
  group_by(Type) %>%  
  summarize(Y=sum(Value)))

Error: cannot handle

So the idea is that first a data.frame is created with two columns Value which is just some data and Type which indicates which group the value is from.

I then try to use summarize to get the number of objects in each group, and then mutate, using the object again to get the sum of the values, after the data has been filtered. However I get the Error: cannot handle. Any ideas what is happening here?

Desired Output:

Type X Y
  A  5 24
  B  5 28

@DavidArenburg Desired output added. At least dplyr is being honest when it is overwhelmed. — John Paul, Aug 13 '15 at 19:10

David Arenburg · Accepted Answer · 2015-08-13T19:29:27.803

6

You could try the following

df %>% 
  group_by(Type) %>% 
  summarise(X = n(), Y = sum(Value[Value > 2]))

# Source: local data frame [2 x 3]
# 
#   Type X  Y
# 1    A 5 24
# 2    B 5 28

The idea is to filter only Value by the desired condition, instead the whole data set

And a bonus solution

library(data.table)
setDT(df)[, .(X = .N, Y = sum(Value[Value > 2])), by = Type]
#    Type X  Y
# 1:    A 5 24
# 2:    B 5 28

Was going to suggest that to @nongkrong but he deleted, with base R we could also do

aggregate(Value ~ Type, df, function(x) c(length(x), sum(x[x>2])))
#   Type Value.1 Value.2
# 1    A       5      24
# 2    B       5      28

edited Aug 13 '15 at 19:29

answered Aug 13 '15 at 19:13

David Arenburg

91,361
17
137
196

Thanks. I went the `ifelse` route just because my actual conditions were more complex and I found the code easier to understand, but your multiple solutions are very informative. – John Paul Aug 13 '15 at 19:32
That's fine, though `ifelse` is usually not very efficient for big data sets. Most of the problem can be solved without it (from my experience at least) – David Arenburg Aug 13 '15 at 19:33
Jumping on David's comment, I totally agree. For big datasets, I prefer `data.table`, although even there I run into some long vector issues with reshaping. – bjoseph Aug 13 '15 at 19:38
@bjoseph it's not about `data.table` vs `dplyr` rather just `ifelse`. See [here](http://stackoverflow.com/questions/16275149/does-ifelse-really-calculate-both-of-its-vectors-every-time-is-it-slow) – David Arenburg Aug 13 '15 at 19:39
@bjoseph You folks convinced me - I rewrote and removed the `ifelse` , I should learn not to be lazy – John Paul Aug 13 '15 at 19:42

score 3 · Answer 2 · answered Aug 13 '15 at 19:14

3

This is also pretty easy to do with ifelse()

df %>% group_by(Type) %>% summarize(X=n(),y=sum( ifelse(Value>2, Value, 0 )))

outputs:

Source: local data frame [2 x 3]

  Type X  y
1    A 5 24
2    B 5 28

answered Aug 13 '15 at 19:14

bjoseph

2,116
17
24

This worked great - my actual problem had somewhat more complicated conditions and this was easier to adapt for me. – John Paul Aug 13 '15 at 19:30
1

I find `dplyr` syntax confusing too, and I make extensive use of `ifelse` inside `dplyr` statements. It works in a surprising amount of cases. – bjoseph Aug 13 '15 at 19:32

Can you use a data.frame twice in a dplyr chain? dplyr says " Error: cannot handle "

2 Answers2