3

I would like to group my data frame by a variable, summarize another variable, but keep all other associated columns.

In Applying group_by and summarise on data while keeping all the columns' info the accepted answer is to use filter() or slice(), which works fine if the answer exists in the data already (i.e. min, max) but this doesn't work if you would like to use a function that generates a new answer (i.e. sum, mean).

In Applying group_by and summarise(sum) but keep columns with non-relevant conflicting data? the accepted answer is to use all the the columns you would like to keep as part of the grouping variable. But this seems like an ineffective solution if you have many columns you would like to keep. For example, the data I'm working with has 26 additional columns.

The best solution I've come up with is to split-apply-combine. But this seems clunky - surely there must be a solution that can be done in a single pipeline.

Example:

location <- c("A", "A", "B", "B", "C", "C")
date <- c("1", "2", "1", "2", "1", "2")
count <- c(3, 6, 4, 2, 7, 5)
important_1 <- c(1,1,2,2,3,3)
important_30 <- c(4,4,5,5,6,6)

df <- data.frame(location = location, date = date, count = count, important_1 = important_1, important_30 = important_30)

I want to summarize the counts that happened on different dates at the same location. I want to keep all the important (imagine there are 30 instead of 2).

My solution so far:

check <- df %>%
  group_by(location) %>%
  summarise(count = sum(count))

add2 <- df %>%
  select(-count, -date) %>%
  distinct()

results <- merge(check, add2)

Is there a way I could accomplish this in a single pipeline? I'd rather keep it organized and avoid creating new objects if possible.

canderson156
  • 1,045
  • 10
  • 24

2 Answers2

3

We can create a column with mutate and then apply distinct

library(dplyr)
df %>% 
   group_by(location) %>% 
   mutate(count = sum(count)) %>% select(-date) %>% 
   distinct(location,  important_1, important_30, .keep_all = TRUE)

If there are multiple column names, we can also use syms to convert to symbol and evaluate (!!!)

df %>% 
     group_by(location) %>% 
     mutate(count = sum(count)) %>% select(-date) %>% 
    distinct(location, !!! rlang::syms(names(.)[startsWith(names(.), 'important')]), .keep_all = TRUE)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • The first solution would work well I think. The second solution wouldn't work well with my actual data because the "important" columns don't actually have the word important in them. I tried to make my example simple, but perhaps I should have made it more similar to the real data – canderson156 Jun 23 '20 at 14:42
  • @canderson156 In that case you can use index to subset the names i.e. `rlang::syms(names(.)[c(1, 2, 3, ...)]` – akrun Jun 23 '20 at 18:39
  • @canderson156 also, perhaps you can do `group_by_at` i.e. `df %>% group_by_at("location", names(.)[4:5)) %>% summarise(count = sum(count))` – akrun Jun 23 '20 at 18:41
-1

You can group_by all the variables that you want to keep and sum count.

library(dplyr)

df %>% 
  group_by(location, important_1, important_30) %>% 
  summarise(count = sum(count))


#  location important_1 important_30 count
#  <chr>          <dbl>        <dbl> <dbl>
#1 A                  1            4     9
#2 B                  2            5     6
#3 C                  3            6    12
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • As I explained in the question, I want to keep 26 columns. It's tedious to write out all of the names, and I was hoping to find an alternative solution. I simplified the example to make the code easier to read, but there are not just 2 columns that I could easily add to group_by – canderson156 Jun 23 '20 at 14:40