R - dplyr Summarize and Retain Other Columns

Question

I am grouping data and then summarizing it, but would also like to retain another column. I do not need to do any evaluations of that column's content as it will always be the same as the group_by column. I can add it to the group_by statement but that does not seem "right". I want to retain State.Full.Name after grouping by State. Thanks

TDAAtest <- data.frame(State=sample(state.abb,1000,replace=TRUE))
TDAAtest$State.Full.Name <- state.name[match(TDAAtest$State,state.abb)]


TDAA.states <- TDAAtest %>%
  filter(!is.na(State)) %>%
  group_by(State) %>%
  summarize(n=n()) %>%
  ungroup() %>%
  arrange(State)

Which column you want to `summarise`? Based on the code, you are doing it by both the columns — akrun, Aug 23 '16 at 04:03
It's probably the best just grouping by both. The alternative is `summarize(State.Full.Name = unique(State.Full.Name), n=n())`, which is less concise. — alistaire, Aug 23 '16 at 04:09
@akrun I clarified. I had left in my not so "right" approach. — atclaus, Aug 23 '16 at 04:09
Also, `count(...)` is equivalent to `group_by(...) %>% summarise(n = n())` — alistaire, Aug 23 '16 at 04:11

akrun · Answer 1 · 2016-08-23T04:10:44.377

42

Perhaps we need

TDAAtest %>% 
     filter(!is.na(State)) %>%
     group_by(State) %>% 
     summarise(State.Full.Name = first(State.Full.Name), n = n())

Or use mutate to create the column and then do the distinct

TDAAtest %>% f
     filter(!is.na(State)) %>%
     group_by(State) %>% 
     mutate(n= n()) %>% 
     distinct(State, .keep_all=TRUE)

edited Aug 23 '16 at 04:10

answered Aug 23 '16 at 04:05

akrun

874,273
37
540
662

score 9 · Answer 2 · answered Jan 01 '22 at 08:34

To retain all columns, you can include across() as a summarize argument, as explained in the documentation for dplyr::do().

by_cyl <- head(mtcars) %>%
  group_by(cyl)
by_cyl %>%
  summarise(m_mpg = mean(mpg), across())

    cyl m_mpg   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     4  22.8  22.8   108    93  3.85  2.32  18.6     1     1     4     1
2     6  20.4  21     160   110  3.9   2.62  16.5     0     1     4     4
3     6  20.4  21     160   110  3.9   2.88  17.0     0     1     4     4
4     6  20.4  21.4   258   110  3.08  3.22  19.4     1     0     3     1
5     6  20.4  18.1   225   105  2.76  3.46  20.2     1     0     3     1
6     8  18.7  18.7   360   175  3.15  3.44  17.0     0     0     3     2

To retain only a subset of unaltered columns, you can select them within across using tidyselect semantics.

What would be the specific syntax to retain only one column? by_cyl %>% summarise(m_mpg = mean(mpg), across(select(hp)) doesn't work — eggrandio, Jun 07 '22 at 12:17
You don't need `select()` within `across` (`select` already has tidyselect semantics); `by_cyl %>% summarise(m_mpg = mean(mpg), across(hp))` should work — Extrapolator, Jun 08 '22 at 13:07

score 4 · Answer 3 · answered Jun 23 '21 at 05:42

I believe there are more accurate answers than the accepted answer specially when you don't have unique data for other columns in each group (e.g. max or min or top n items based on one particular column ).

Although the accepted answer works for this question, for instance, you would like to find the county with the max population for each state. (You need to have county and population columns).

We have the following options:

1. dplyr version

From this link, you have three extra operations (mutate, ungroup and filter) to achieve that:

TDAAtest %>% 
     filter(!is.na(State)) %>%
     group_by(State) %>% 
     mutate(maxPopulation = max(Population)) %>% 
     ungroup() %>%
     filter(maxPopulation == Population)

2. Function version

This one gives you as much flexibility as you want and you can apply any kind of operation to each group:

maxFUN = function(x) {
  # order population in a descending order
  x = x[with(x, order(-Population)), ]
  x[1, ]
}

TDAAtest %>% 
     filter(!is.na(State)) %>%
     group_by(State) %>%
     do(maxFUN(.))

This one is highly recommended for more complex operations. For instance, you can return top n (topN) counties per state by having x[1:topN] for the returned dataframe in maxFUN.

R - dplyr Summarize and Retain Other Columns

3 Answers3

Linked