Understanding dplyr and group_by

Question

I've been leveraging dplyr in my workflows for quite some time. I'm coming to the realization that perhaps I don't understand the group_by function. Can someone please explain if there is a better approach to accomplishing my goals.

My initial understanding was that by introducing group_by() before operations such as mutate, the mutate function would perform its function dicretely across groups specified by group_by(), restarting it's operation on each Condition specified by group_by()

This doesn't seem to be true and I've had to resort to splitting my data tables into lists by the Condition that I had previously entered into group_by(), performing my intended functions, and then collapsing the list back into a matrix; by the use of lapply.

Example below. My intention was to perform a cumsum operation on column TVC for each Condition. However, you'll see that the summation column is a straightforward cumsum operation across the TVC column without discretization between groups specified by the Condition column.

> (data %>% filter(`Elapsed Time (days)`<=8) %>%
+   arrange(Condition,`Elapsed Time (days)`) %>%
+   select(Condition, `Elapsed Time (days)`, TVC) %>%
+   filter(!is.na(TVC)) %>%
+   group_by(Condition) %>%
+   mutate(summation =cumsum(TVC)))
# A tibble: 94 x 4
# Groups:   Condition [24]
   Condition `Elapsed Time (days)`       TVC  summation
   <chr>     <drtn>                    <dbl>      <dbl>
 1 1A        0.000000 secs         15400921.  15400921.
 2 1A        4.948611 secs         11877256.  27278177 
 3 1A        6.027778 secs         11669731.  38947908.
 4 1A        6.949306 secs         11908853.  50856761.
 5 1B        0.000000 secs         14514263.  65371024.
 6 1B        4.948611 secs          8829356.  74200380.
 7 1B        6.027778 secs         12068221.  86268601.
 8 1B        6.949306 secs         10111424.  96380026.
 9 1C        0.000000 secs         15400921. 111780946.
10 1C        4.948611 secs          8680060  120461006.

Try to make a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Are you able to reproduce this behavior now? — henrik_ibsen, Nov 02 '20 at 09:04

score 1 · Answer 1 · answered Nov 01 '20 at 18:14

1

Hey I would try this operation before your code chunk:

df$Condition <- as.factor(df$Condition)

I think group_by works best when working with factors. I think it's supposed to work with characters also but in my experience factor is better with fewer bugs. I don't know if others have this issue.

After that, do this, as Karthik suggests:

df %>% group_by(Condition) %>% mutate(summation =cumsum(TVC))

answered Nov 01 '20 at 18:14

hachiko

671
7
20

tried as suggested; it was a no go...still showing the `summation` as a cumsum across all conditions – squanchy Nov 01 '20 at 20:08
ooo here try this change in the mutate: mutate(summation = sum(TVC)) I think cumsum is not what you want – hachiko Nov 01 '20 at 20:47

Understanding dplyr and group_by

1 Answers1

Linked