Pulling out the top row from a data frame conditional on 2 columns

Question

Here is what my data looks like

Animal  Age  Weight
Cow  5   100
Pig  4   80
Pig  4   85    
Sheep  2   60
Sheep  3   65
Sheep  3   75

The data is sorted by animal, then by age within animal groups, and then by weight (increasing).

Is there a way to subset this data so that I am left with only the lightest animals for each animal group AND age group? This is what the output would become:

Animal  Age  Weight
Cow    5    100
Pig    4    80
Sheep    2    60
Sheep    3    65

edit: I forgot to mention that I want to be able to retain any additional information in each of the rows when assigning it to the new dataset. For example, if there was another column containing information about farm location, I want to keep that. Also, I want to disregard ties, it doesn't matter which of the results it keeps, as long as it keeps one.

`library(data.table); setDT(df)[, min(`Weight(kg)`), .(Animal, Age)]` — pogibas, Nov 06 '17 at 13:55
In base R, `aggregate(Weight.kg. ~ Animal + Age, data=dat, FUN=head, 1)` where the LHS is the name of the weight variable. — lmo, Nov 06 '17 at 14:01
Please check my answer - it will keep any additional information — HAVB, Nov 06 '17 at 15:04
Your answer keeps the additional information. I'm curious if there is a way to stop it from keeping ties. — Dan, Nov 06 '17 at 15:20

HAVB · Accepted Answer · 2017-11-06T15:32:29.900

1

Another dplyr recipe:

library(dplyr)
animals <- data.frame(Animal = c("Cow", "Pig", "Pig", 'Sheep', 'Sheep', 'Sheep'),
                      Age = c(5, 4, 4, 2, 3, 3),
                      Weight = c(100, 80, 85, 60, 65, 75))

animals %>% group_by(Animal, Age) %>% top_n(-1, Weight)

# A tibble: 4 x 3
# Groups:   Animal, Age [4]
  Animal   Age Weight
  <fctr> <dbl>  <dbl>
1    Cow     5    100
2    Pig     4     80
3  Sheep     2     60
4  Sheep     3     65

top_n is a handy function that lets you keep only the N rows with the larger (or lower) value for any given column.

In this example, top_n(-1, Weight) gives you the lowest value (-1, as 1 would be the highest) for variable "Weight"

EDIT:

To disregard "ties", keep only the last row when there's more than one per group:

animals %>% 
    group_by(Animal, Age) %>% 
    top_n(-1, Weight) %>% 
    filter(row_number() == n())

edited Nov 06 '17 at 15:32

answered Nov 06 '17 at 13:57

HAVB

1,858
1
22
37

Thanks! This is doing exactly what I want. Just one thing, whenever there is a tie in the weight, it keeps all values. Is there a way for me to tell R to disregard ties and only keep one? – Dan Nov 06 '17 at 15:04
Not sure if i'm doing something wrong, but adding that last summarise part to the chain stops it from retaining the additional information. It does remove ties though. – Dan Nov 06 '17 at 15:29
@Dan try now, corrected – HAVB Nov 06 '17 at 15:33
Thank you! Working perfectly now. – Dan Nov 06 '17 at 15:35

score 0 · Answer 2 · answered Nov 06 '17 at 13:56

0

You can do this easily with dplyr...

library(dplyr)
df %>% group_by(Animal,Age) %>% 
       summarise(MinWeight=min(Weight))

  Animal   Age MinWeight
1    Cow     5       100
2    Pig     4        80
3  Sheep     2        60
4  Sheep     3        65

(I changed the name of your variable Weight(kg) to just Weight, as the brackets are not suitable in variable names).

answered Nov 06 '17 at 13:56

Andrew Gustar

17,295
1
22
32

I've edited my post to add additional information. I also need it to output any additional information in the rows, even if they have no relevance in the sorting of each individual. – Dan Nov 06 '17 at 14:58

Pulling out the top row from a data frame conditional on 2 columns

2 Answers2