85

I have a large dataset with 22000 rows and 25 columns. I am trying to group my dataset based on one of the columns and take the min value of the other column based on the grouped dataset. However, the problem is that it only gives me two columns containing the grouped column and the column having the min value... but I need all the information of other columns related to the rows with the min values. Here is a simple example just to make it reproducible:

    data<- data.frame(a=1:10, b=c("a","a","a","b","b","c","c","d","d","d"), c=c(1.2, 2.2, 2.4, 1.7, 2.7, 3.1, 3.2, 4.2, 3.3, 2.2), d= c("small", "med", "larg", "larg", "larg", "med", "small", "small", "small", "med"))

    d<- data %>%
    group_by(b) %>%
    summarise(min_values= min(c))
    d
    b min_values
    1 a        1.2
    2 b        1.7
    3 c        3.1
    4 d        2.2

So, I need to have also the information related to columns a and d, however, since I have duplications in the values in column c I cannot merge them based on the min_value column... I was wondering if there is any way to keep other columns' information when we are using dplyr package.

I have found some explanation here "dplyr: group_by, subset and summarise" and here "Finding percentage in a sub-group using group_by and summarise" but none of the addresses my problem.

ROMANIA_engineer
  • 54,432
  • 29
  • 203
  • 199
Momeneh Foroutan
  • 915
  • 1
  • 6
  • 8
  • Exactly how do you propose the resulting data.frame would look? How would the other data look when compressed into a single row? – r2evans May 04 '15 at 07:11

4 Answers4

67

You can use group_by without summarize:

data %>%
  group_by(b) %>%
  mutate(min_values = min(c)) %>%
  ungroup()
bergant
  • 7,122
  • 1
  • 20
  • 24
  • 2
    Thank you so much Bergant, the thing is that your method gives me all the rows... but it is important for me to know for example the min value is related to the number 4 in col "a". Docendo's answer below is exactly what I needed. Thanks anyway for your time on answering this :-) – Momeneh Foroutan May 04 '15 at 10:03
  • 3
    this answer. my 'duh' moment of the week. – Brian D Mar 15 '21 at 18:41
  • 1
    So simple, yet so powerful. – Karol Daniluk Jun 17 '21 at 10:11
  • @bergant this didn't work for me, the result still shows only 2 columns after I summarise, even though I included ungroup(). – Katya Apr 28 '22 at 13:57
  • 1
    "ungroup()" should be replaced with slice(1), this will reduce the rows to one per group & can also work with multiple summary columns – Aaron C Jan 11 '23 at 01:04
61

Here are two options using a) filter and b) slice from dplyr. In this case there are no duplicated minimum values in column c for any of the groups and so the results of a) and b) are the same. If there were duplicated minima, approach a) would return each minima per group while b) would only return one minimum (the first) in each group.

a)

> data %>% group_by(b) %>% filter(c == min(c))
#Source: local data frame [4 x 4]
#Groups: b
#
#   a b   c     d
#1  1 a 1.2 small
#2  4 b 1.7  larg
#3  6 c 3.1   med
#4 10 d 2.2   med

Or similarly

> data %>% group_by(b) %>% filter(min_rank(c) == 1L)
#Source: local data frame [4 x 4]
#Groups: b
#
#   a b   c     d
#1  1 a 1.2 small
#2  4 b 1.7  larg
#3  6 c 3.1   med
#4 10 d 2.2   med

b)

> data %>% group_by(b) %>% slice(which.min(c))
#Source: local data frame [4 x 4]
#Groups: b
#
#   a b   c     d
#1  1 a 1.2 small
#2  4 b 1.7  larg
#3  6 c 3.1   med
#4 10 d 2.2   med
talat
  • 68,970
  • 21
  • 126
  • 157
  • 3
    Thanks a million Docendo for the answer. This is exactly what I was looking for :-) – Momeneh Foroutan May 04 '15 at 10:04
  • Exactly what I needed! And I discovered the function slice as a bonus, thx! – Aditya Aug 18 '15 at 08:04
  • 3
    What if you are trying to use summarize to get information that is not contained in the original data, and therefore cannot be "filtered"? for example, sum or mean? – canderson156 Jun 22 '20 at 20:14
  • Late to the party, but you can still filter by the return of functions. For example, you can do `df %>% group_by(x) %>% filter(n() > 10)` to filter for groups with more than ten observations, without having assigned `n()` to any previous column. – mhovd Nov 30 '21 at 10:31
3

Using sqldf:

library(sqldf)
 # Two options:
sqldf('SELECT * FROM data GROUP BY b HAVING min(c)')
sqldf('SELECT a, b, min(c) min, d FROM data GROUP BY b')

Output:

   a b   c     d
1  1 a 1.2 small
2  4 b 1.7  larg
3  6 c 3.1   med
4 10 d 2.2   med
mpalanco
  • 12,960
  • 2
  • 59
  • 67
1

With dplyr 1.1.0, you can use .by in mutate, summarize, filter and slice to do temporary grouping. With mutate, all rows and columns are kept:

data %>% 
  mutate(min_values = min(c), .by = b)

With filter, or slice, rows are summarized and all columns are kept:

data %>% 
  slice_min(c, .by = b)

data %>% 
  filter(c = min(c), .by = b)
Maël
  • 45,206
  • 3
  • 29
  • 67