How to choose between two replicated quantities in R

Question

This is a simplified example. I have a data frame with two variables like this:

a <- c(1,1,1,2,2,2,3,3,6,7,4,5,5,8)
b <- c(5,10,4,2,8,4,6,9,12,3,7,4,1,7)
D <- data.frame(a,b)

As you can see, there are 8 values for a but they have replicated, and my data-frame has 14 observations. I want to create a data-frame which has 8 observations in which the a quantities are unique, and the b values are the minimum of choices, i.e., the result should be like:

Pick your favorite method from the FAQ [How to sum a variable by group](https://stackoverflow.com/q/1660124/903061), and then replace `sum` with `min` to get the minimum instead. — Gregor Thomas, Jul 23 '18 at 16:53

DanY · Answer 1 · 2018-07-26T05:09:16.137

3

Here's how to do it with base R:

#both lines do the same thing, pick one
aggregate(D$b, by = D["a"], FUN = min)
aggregate(b ~ a, data = D, FUN = min)

Here's how to do it with data.table:

library(data.table)
setDT(D)
D[ , .(min(b)), by=a]

Here's how to do it with tidyverse functions:

library(tidyverse) #or just library(dplyr)
D %>% group_by(a) 
  %>% summarize(min(b))

edited Jul 26 '18 at 05:09

answered Jul 23 '18 at 16:52

DanY

5,920
1
13
33

1

Using formulas notation is a more readable version of `aggregate` : `aggregate(b~a, data=D, FUN = min)` – Jilber Urbina Jul 23 '18 at 17:04
1

@JilberUrbina - You're probably right. I loath `aggregate()`. It's clearly a function written without data.frames in mind, yet it's a common thing to do with data stored in a data.frame. – DanY Jul 23 '18 at 17:23

score 2 · Accepted Answer · answered Jul 23 '18 at 17:02

2

Using R base approach:

> D2  <- D[order(D$a, D$b ), ]
> D2  <- D2[ !duplicated(D2$a), ]
> D2
   a  b
3  1  4
4  2  2
7  3  6
11 4  7
13 5  1
9  6 12
10 7  3
14 8  7

answered Jul 23 '18 at 17:02

Jilber Urbina

58,147
10
114
138

This solution certainly works for this very particular question, but I would advise readers to look at other solutions posted below that use `aggregate`, `data.table`, or `dplyr`/`tidyverse` functions as those solutions allow you to easily switch out the function of interest (i.e., `sum` instead of `min`). – DanY Jul 23 '18 at 17:29
I believe it's inefficient and convoluted to sort and remove duplicates when it's the most basic case of aggregation. – moodymudskipper Jul 24 '18 at 10:18
1

@Moody_Mudskipper I just posted an alternative different from `aggregate`, because there are other answers using `aggregte`. This is only another point of view. – Jilber Urbina Jul 24 '18 at 14:48
That's fair, and I didn't downvote, but I think it was worth mentioning – moodymudskipper Jul 24 '18 at 14:53

score 1 · Answer 3 · answered Jul 23 '18 at 16:57

1

A base R option would be

aggregate(b ~ a, D, min)

answered Jul 23 '18 at 16:57

akrun

874,273
37
540
662

Ankur · Answer 4 · 2018-07-28T13:00:45.280

0

library (dplyr)

D<-D %>% group_by(a) %>% summarize(min(b))

edited Jul 28 '18 at 13:00

answered Jul 23 '18 at 16:59

Ankur

141
10

Or: library(dplyr); D %>% group_by(a) %>% summarize(min(b)) – Dave2e Jul 23 '18 at 17:01
2

This will still have duplicated rows if there is a tie for the min. The `summarize` method is better to avoid this problem. – Gregor Thomas Jul 23 '18 at 17:03

How to choose between two replicated quantities in R

4 Answers4