Applying group_by and summarise on data while keeping all the columns' info

Question

I have a large dataset with 22000 rows and 25 columns. I am trying to group my dataset based on one of the columns and take the min value of the other column based on the grouped dataset. However, the problem is that it only gives me two columns containing the grouped column and the column having the min value... but I need all the information of other columns related to the rows with the min values. Here is a simple example just to make it reproducible:

    data<- data.frame(a=1:10, b=c("a","a","a","b","b","c","c","d","d","d"), c=c(1.2, 2.2, 2.4, 1.7, 2.7, 3.1, 3.2, 4.2, 3.3, 2.2), d= c("small", "med", "larg", "larg", "larg", "med", "small", "small", "small", "med"))

    d<- data %>%
    group_by(b) %>%
    summarise(min_values= min(c))
    d
    b min_values
    1 a        1.2
    2 b        1.7
    3 c        3.1
    4 d        2.2

So, I need to have also the information related to columns a and d, however, since I have duplications in the values in column c I cannot merge them based on the min_value column... I was wondering if there is any way to keep other columns' information when we are using dplyr package.

I have found some explanation here "dplyr: group_by, subset and summarise" and here "Finding percentage in a sub-group using group_by and summarise" but none of the addresses my problem.

Exactly how do you propose the resulting data.frame would look? How would the other data look when compressed into a single row? — r2evans, May 04 '15 at 07:11

score 67 · Answer 1 · answered May 04 '15 at 07:18

67

You can use group_by without summarize:

data %>%
  group_by(b) %>%
  mutate(min_values = min(c)) %>%
  ungroup()

answered May 04 '15 at 07:18

bergant

7,122
1
20
24

2

Thank you so much Bergant, the thing is that your method gives me all the rows... but it is important for me to know for example the min value is related to the number 4 in col "a". Docendo's answer below is exactly what I needed. Thanks anyway for your time on answering this :-) – Momeneh Foroutan May 04 '15 at 10:03
3

this answer. my 'duh' moment of the week. – Brian D Mar 15 '21 at 18:41
1

So simple, yet so powerful. – Karol Daniluk Jun 17 '21 at 10:11
@bergant this didn't work for me, the result still shows only 2 columns after I summarise, even though I included ungroup(). – Katya Apr 28 '22 at 13:57
1

"ungroup()" should be replaced with slice(1), this will reduce the rows to one per group & can also work with multiple summary columns – Aaron C Jan 11 '23 at 01:04

talat · Accepted Answer · 2015-05-04T08:09:30.107

61

Here are two options using a) filter and b) slice from dplyr. In this case there are no duplicated minimum values in column c for any of the groups and so the results of a) and b) are the same. If there were duplicated minima, approach a) would return each minima per group while b) would only return one minimum (the first) in each group.

a)

> data %>% group_by(b) %>% filter(c == min(c))
#Source: local data frame [4 x 4]
#Groups: b
#
#   a b   c     d
#1  1 a 1.2 small
#2  4 b 1.7  larg
#3  6 c 3.1   med
#4 10 d 2.2   med

Or similarly

> data %>% group_by(b) %>% filter(min_rank(c) == 1L)
#Source: local data frame [4 x 4]
#Groups: b
#
#   a b   c     d
#1  1 a 1.2 small
#2  4 b 1.7  larg
#3  6 c 3.1   med
#4 10 d 2.2   med

b)

> data %>% group_by(b) %>% slice(which.min(c))
#Source: local data frame [4 x 4]
#Groups: b
#
#   a b   c     d
#1  1 a 1.2 small
#2  4 b 1.7  larg
#3  6 c 3.1   med
#4 10 d 2.2   med

edited May 04 '15 at 08:09

answered May 04 '15 at 07:58

talat

68,970
21
126
157

3

Thanks a million Docendo for the answer. This is exactly what I was looking for :-) – Momeneh Foroutan May 04 '15 at 10:04
Exactly what I needed! And I discovered the function slice as a bonus, thx! – Aditya Aug 18 '15 at 08:04
3

What if you are trying to use summarize to get information that is not contained in the original data, and therefore cannot be "filtered"? for example, sum or mean? – canderson156 Jun 22 '20 at 20:14
Late to the party, but you can still filter by the return of functions. For example, you can do `df %>% group_by(x) %>% filter(n() > 10)` to filter for groups with more than ten observations, without having assigned `n()` to any previous column. – mhovd Nov 30 '21 at 10:31

score 3 · Answer 3 · answered Jul 14 '15 at 12:15

3

Using sqldf:

library(sqldf)
 # Two options:
sqldf('SELECT * FROM data GROUP BY b HAVING min(c)')
sqldf('SELECT a, b, min(c) min, d FROM data GROUP BY b')

Output:

   a b   c     d
1  1 a 1.2 small
2  4 b 1.7  larg
3  6 c 3.1   med
4 10 d 2.2   med

answered Jul 14 '15 at 12:15

mpalanco

12,960
2
59
67

score 1 · Answer 4 · answered Dec 15 '22 at 11:11

With dplyr 1.1.0, you can use .by in mutate, summarize, filter and slice to do temporary grouping. With mutate, all rows and columns are kept:

data %>% 
  mutate(min_values = min(c), .by = b)

With filter, or slice, rows are summarized and all columns are kept:

data %>% 
  slice_min(c, .by = b)

data %>% 
  filter(c = min(c), .by = b)

Applying group_by and summarise on data while keeping all the columns' info

4 Answers4

Linked

Related