44

I am grouping data and then summarizing it, but would also like to retain another column. I do not need to do any evaluations of that column's content as it will always be the same as the group_by column. I can add it to the group_by statement but that does not seem "right". I want to retain State.Full.Name after grouping by State. Thanks

TDAAtest <- data.frame(State=sample(state.abb,1000,replace=TRUE))
TDAAtest$State.Full.Name <- state.name[match(TDAAtest$State,state.abb)]


TDAA.states <- TDAAtest %>%
  filter(!is.na(State)) %>%
  group_by(State) %>%
  summarize(n=n()) %>%
  ungroup() %>%
  arrange(State)
atclaus
  • 1,046
  • 1
  • 9
  • 12

3 Answers3

42

Perhaps we need

TDAAtest %>% 
     filter(!is.na(State)) %>%
     group_by(State) %>% 
     summarise(State.Full.Name = first(State.Full.Name), n = n())

Or use mutate to create the column and then do the distinct

TDAAtest %>% f
     filter(!is.na(State)) %>%
     group_by(State) %>% 
     mutate(n= n()) %>% 
     distinct(State, .keep_all=TRUE)
akrun
  • 874,273
  • 37
  • 540
  • 662
9

To retain all columns, you can include across() as a summarize argument, as explained in the documentation for dplyr::do().

by_cyl <- head(mtcars) %>%
  group_by(cyl)
by_cyl %>%
  summarise(m_mpg = mean(mpg), across())

    cyl m_mpg   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     4  22.8  22.8   108    93  3.85  2.32  18.6     1     1     4     1
2     6  20.4  21     160   110  3.9   2.62  16.5     0     1     4     4
3     6  20.4  21     160   110  3.9   2.88  17.0     0     1     4     4
4     6  20.4  21.4   258   110  3.08  3.22  19.4     1     0     3     1
5     6  20.4  18.1   225   105  2.76  3.46  20.2     1     0     3     1
6     8  18.7  18.7   360   175  3.15  3.44  17.0     0     0     3     2

To retain only a subset of unaltered columns, you can select them within across using tidyselect semantics.

Extrapolator
  • 343
  • 3
  • 7
  • What would be the specific syntax to retain only one column? by_cyl %>% summarise(m_mpg = mean(mpg), across(select(hp)) doesn't work – eggrandio Jun 07 '22 at 12:17
  • You don't need `select()` within `across` (`select` already has tidyselect semantics); `by_cyl %>% summarise(m_mpg = mean(mpg), across(hp))` should work – Extrapolator Jun 08 '22 at 13:07
4

I believe there are more accurate answers than the accepted answer specially when you don't have unique data for other columns in each group (e.g. max or min or top n items based on one particular column ).

Although the accepted answer works for this question, for instance, you would like to find the county with the max population for each state. (You need to have county and population columns).

We have the following options:

1. dplyr version

From this link, you have three extra operations (mutate, ungroup and filter) to achieve that:

TDAAtest %>% 
     filter(!is.na(State)) %>%
     group_by(State) %>% 
     mutate(maxPopulation = max(Population)) %>% 
     ungroup() %>%
     filter(maxPopulation == Population)

2. Function version

This one gives you as much flexibility as you want and you can apply any kind of operation to each group:

maxFUN = function(x) {
  # order population in a descending order
  x = x[with(x, order(-Population)), ]
  x[1, ]
}

TDAAtest %>% 
     filter(!is.na(State)) %>%
     group_by(State) %>%
     do(maxFUN(.)) 

This one is highly recommended for more complex operations. For instance, you can return top n (topN) counties per state by having x[1:topN] for the returned dataframe in maxFUN.

Habib Karbasian
  • 556
  • 8
  • 18