Summarize all group values and a conditional subset in the same call

Question

I'll illustrate my question with an example.

Sample data:

 df <- data.frame(ID = c(1, 1, 2, 2, 3, 5), A = c("foo", "bar", "foo", "foo", "bar", "bar"), B =     c(1, 5, 7, 23, 54, 202))

df
  ID   A   B
1  1 foo   1
2  1 bar   5
3  2 foo   7
4  2 foo  23
5  3 bar  54
6  5 bar 202

What I want to do is to summarize, by ID, the sum of B and the sum of B when A is "foo". I can do this in a couple steps like:

require(magrittr)
require(dplyr)

df1 <- df %>%
  group_by(ID) %>%
  summarize(sumB = sum(B))

df2 <- df %>%
  filter(A == "foo") %>%
  group_by(ID) %>%
  summarize(sumBfoo = sum(B))

left_join(df1, df2)

  ID sumB sumBfoo
1  1    6       1
2  2   30      30
3  3   54      NA
4  5  202      NA

However, I'm looking for a more elegant/faster way, as I'm dealing with 10gb+ of out-of-memory data in sqlite.

require(sqldf)
my_db <- src_sqlite("my_db.sqlite3", create = T)
df_sqlite <- copy_to(my_db, df)

I thought of using mutate to define a new Bfoo column:

df_sqlite %>%
  mutate(Bfoo = ifelse(A=="foo", B, 0))

Unfortunately, this doesn't work on the database end of things.

Error in sqliteExecStatement(conn, statement, ...) : 
  RS-DBI driver: (error in statement: no such function: IFELSE)

I believe the problem is that you are trying to mix character and numeric in Bfoo which is not possible — talat, May 07 '14 at 21:38
@beginneR it's all numeric and the code runs fine locally... — kevinykuo, May 07 '14 at 21:40
What versions of R and dplry are you using? Your code ran fine for me locally as well (after I changed %>% to %.%). — MrFlick, May 07 '14 at 22:06
@MrFlick yeah it's supposed to run "locally" (by that I meant on a data.frame and not in the sqlite database). I'm trying to figure out how to best run it in sqlite, which doesn't recognize "ifelse" — kevinykuo, May 07 '14 at 22:08
@pearpies I don't understand. Is this running in R? or is this running in sqllite? How is it getting to the database? — MrFlick, May 07 '14 at 22:09
Try using `if(A=="foo") B else 0` - dplyr will try to convert that to a SQL case statement, which might work for you. — hadley, May 07 '14 at 22:39
In case anyone is looking to summarize based on mathematical conditions (instead of matching strings): https://stackoverflow.com/questions/59199273/summarize-with-mathematical-conditions-in-dplyr/59199366#59199366 — TheSciGuy, Dec 05 '19 at 17:29

eipi10 · Answer 1 · 2020-12-10T02:51:05.913

130

You can do both sums in a single dplyr statement:

df1 <- df %>%
  group_by(ID) %>%
  summarize(sumB = sum(B),
            sumBfoo = sum(B[A=="foo"]))

And here is a data.table version:

library(data.table)

dt = setDT(df) 

dt1 = dt[ , .(sumB = sum(B),
              sumBfoo = sum(B[A=="foo"])), 
          by = ID]

dt1

   ID sumB sumBfoo
1:  1    6       1
2:  2   30      30
3:  3   54       0
4:  5  202       0

edited Dec 10 '20 at 02:51

answered May 07 '14 at 22:37

eipi10

91,525
24
209
285

this would work if I could fit everything into memory. i'll definitely keep it in mind. problem is sql doesn't recognize ``[``. sorry i wasn't clear in the question! – kevinykuo May 07 '14 at 22:54
1

The dplyr database vignette might be helpful for future reference (in case you haven't already seen it): http://cran.rstudio.com/web/packages/dplyr/vignettes/databases.html – eipi10 May 07 '14 at 22:57
1

How could I miss that in the documentation?! Imho one of dplyr's most useful functionalities for dealing with data, especially when working with tidy data. – r0bert Nov 18 '15 at 17:26
7

great answer! Have not come across this kind of subsetting before : `df %>% summarize(function(colname[conditional_colname]))` – vagabond Nov 01 '16 at 19:12
Wow, never knew u could subset like that. Thank you very much! – ktyagi Jun 09 '20 at 17:36
@eipi10 I'm aware of how old this answer is, but do you happen to know the `data.table` alternative for the conditional `A == "foo"` that you used for `sumBfoo`? – PLY Dec 08 '20 at 13:12

score 38 · Accepted Answer · answered May 07 '14 at 22:56

38

Writing up @hadley's comment as an answer

df_sqlite %>%
  group_by(ID) %>%
  mutate(Bfoo = if(A=="foo") B else 0) %>%
  summarize(sumB = sum(B),
            sumBfoo = sum(Bfoo)) %>%
  collect

answered May 07 '14 at 22:56

kevinykuo

4,600
5
23
31

1

what is the function of `dplyr::collect` here? – Elliot Mar 28 '16 at 03:12
1

`collect` forces the evaluation of the verbs prior to it – kevinykuo Apr 05 '16 at 03:27

score 16 · Answer 3 · answered Dec 08 '14 at 13:11

If you want to do counting instead of summarizing, then the answer is somewhat different. The change in code is small, especially in the conditional counting part.

df1 <- df %>%
    group_by(ID) %>%
    summarize(countB = n(),
              countBfoo = sum(A=="foo"))

df1
Source: local data frame [4 x 3]

  ID countB countBfoo
1  1      2         1
2  2      2         2
3  3      1         0
4  5      1         0

score -1 · Answer 4 · answered Apr 21 '19 at 09:36

-1

If you wanted to count the rows, instead of summing them, can you pass a variable to the function:

    df1 <- df %>%
group_by(ID) %>%
summarize(RowCountB = n(),
          RowCountBfoo = n(A=="foo"))

I get an error both with n() and nrow().

answered Apr 21 '19 at 09:36

user3116297

51
9

Is this an answer or another question? – Rekamanon Jul 07 '20 at 20:52

Summarize all group values and a conditional subset in the same call

4 Answers4

Linked

Related