Applying group_by and summarise(sum) but keep a large number of additional columns

Question

I would like to group my data frame by a variable, summarize another variable, but keep all other associated columns.

In Applying group_by and summarise on data while keeping all the columns' info the accepted answer is to use filter() or slice(), which works fine if the answer exists in the data already (i.e. min, max) but this doesn't work if you would like to use a function that generates a new answer (i.e. sum, mean).

In Applying group_by and summarise(sum) but keep columns with non-relevant conflicting data? the accepted answer is to use all the the columns you would like to keep as part of the grouping variable. But this seems like an ineffective solution if you have many columns you would like to keep. For example, the data I'm working with has 26 additional columns.

The best solution I've come up with is to split-apply-combine. But this seems clunky - surely there must be a solution that can be done in a single pipeline.

Example:

location <- c("A", "A", "B", "B", "C", "C")
date <- c("1", "2", "1", "2", "1", "2")
count <- c(3, 6, 4, 2, 7, 5)
important_1 <- c(1,1,2,2,3,3)
important_30 <- c(4,4,5,5,6,6)

df <- data.frame(location = location, date = date, count = count, important_1 = important_1, important_30 = important_30)

I want to summarize the counts that happened on different dates at the same location. I want to keep all the important (imagine there are 30 instead of 2).

My solution so far:

check <- df %>%
  group_by(location) %>%
  summarise(count = sum(count))

add2 <- df %>%
  select(-count, -date) %>%
  distinct()

results <- merge(check, add2)

Is there a way I could accomplish this in a single pipeline? I'd rather keep it organized and avoid creating new objects if possible.

1

`important_2` is not defined – akrun Jun 22 '20 at 21:10

score 3 · Accepted Answer · answered Jun 22 '20 at 21:13

3

We can create a column with mutate and then apply distinct

library(dplyr)
df %>% 
   group_by(location) %>% 
   mutate(count = sum(count)) %>% select(-date) %>% 
   distinct(location,  important_1, important_30, .keep_all = TRUE)

If there are multiple column names, we can also use syms to convert to symbol and evaluate (!!!)

df %>% 
     group_by(location) %>% 
     mutate(count = sum(count)) %>% select(-date) %>% 
    distinct(location, !!! rlang::syms(names(.)[startsWith(names(.), 'important')]), .keep_all = TRUE)

answered Jun 22 '20 at 21:13

akrun

874,273
37
540
662

The first solution would work well I think. The second solution wouldn't work well with my actual data because the "important" columns don't actually have the word important in them. I tried to make my example simple, but perhaps I should have made it more similar to the real data – canderson156 Jun 23 '20 at 14:42
@canderson156 In that case you can use index to subset the names i.e. `rlang::syms(names(.)[c(1, 2, 3, ...)]` – akrun Jun 23 '20 at 18:39
@canderson156 also, perhaps you can do `group_by_at` i.e. `df %>% group_by_at("location", names(.)[4:5)) %>% summarise(count = sum(count))` – akrun Jun 23 '20 at 18:41

score -1 · Answer 2 · answered Jun 23 '20 at 07:06

-1

You can group_by all the variables that you want to keep and sum count.

library(dplyr)

df %>% 
  group_by(location, important_1, important_30) %>% 
  summarise(count = sum(count))


#  location important_1 important_30 count
#  <chr>          <dbl>        <dbl> <dbl>
#1 A                  1            4     9
#2 B                  2            5     6
#3 C                  3            6    12

answered Jun 23 '20 at 07:06

Ronak Shah

377,200
20
156
213

As I explained in the question, I want to keep 26 columns. It's tedious to write out all of the names, and I was hoping to find an alternative solution. I simplified the example to make the code easier to read, but there are not just 2 columns that I could easily add to group_by – canderson156 Jun 23 '20 at 14:40

Applying group_by and summarise(sum) but keep a large number of additional columns

2 Answers2