Select highest values in a dataframe by group

Question

I have the following df

dat <- data.frame(Cases = c("Student3","Student3","Student3","Student1","Student1",
"Student2","Student2","Student2","Student4"), Class = rep("Math", 9),
Scores = c(9,5,2,7,3,8,5,1,7), stringsAsFactors = F)


> dat
   Cases    Class   Scores
1 Student3  Math      9
2 Student3  Math      5
3 Student3  Math      2
4 Student1  Math      7
5 Student1  Math      3
6 Student2  Math      8
7 Student2  Math      5
8 Student2  Math      1
9 Student4  Math      7

On the other hand, I have another df with the following information:

d <- data.frame(Cases = c("Student3", "Student1",
"Student2", "Student4"), Class = rep("Math", 4), stringsAsFactors = F)

    Cases  Class
1 Student3  Math
2 Student1  Math
3 Student2  Math
4 Student4  Math

With these two, I want to extract the highest scores for each student. So my output would look like this:

> dat_output
    Cases  Class   Scores
1 Student3  Math      9
2 Student1  Math      7
3 Student2  Math      8
4 Student4  Math      7

I tried with merge but it is not extracting just the highest scores.

I don't think it is a duplicate because I never asked to use a library. Plus, the accepted answer was written on `base R`, not on `dplyr`. — Cahidora, Aug 17 '18 at 08:21
possible duplicate then of https://stackoverflow.com/questions/24558328/how-to-select-the-row-with-the-maximum-value-in-each-group — tjebo, Aug 17 '18 at 08:23
and https://stackoverflow.com/questions/27534284/filter-dataframe-by-maximum-values-in-each-group/27534390 — tjebo, Aug 17 '18 at 08:24
@Tjebo Yes, there are many other posts with "find max per group", but this post has "filter" step: only do this for subset of students that are also in d dataframe. — zx8754, Aug 17 '18 at 08:25
@Tjebo please read the question carefully, it doesn't ask for maximum in group. — Ronak Shah, Aug 17 '18 at 08:26
@zx8754 Even if the list of groups that the maximum is to be found for come from a different dataframe, it's still a "maximum in group" question that has numerous duplicates (8+ just searching with the current title). The "filter" step is not reflected in the question title, and to be honest should be a separate question anyway (which also [has been asked before](https://stackoverflow.com/questions/11612235/select-rows-from-a-data-frame-based-on-values-in-a-vector)). — Mikko Marttila, Aug 17 '18 at 10:21
@MikkoMarttila do (vote) as you wish, of course. Agree to disagree. — zx8754, Aug 17 '18 at 10:23

Ronak Shah · Accepted Answer · 2018-08-17T07:51:30.610

6

We can use sapply on each Cases in d, subset the dat for that Cases and get the max score for it.

sapply(d$Cases, function(x) max(dat$Scores[dat$Cases %in% x]))

#Student3 Student1 Student2 Student4 
#       9        7        8        7

To get the result as data.frame

transform(d, Scores = sapply(d$Cases, function(x) 
                     max(dat$Scores[dat$Cases %in% x])))

#    Cases Class Scores
# Student3  Math      9 
# Student1  Math      7
# Student2  Math      8
# Student4  Math      7

Note - I have assumed your d to be

d <- data.frame(Cases = c("Student3", "Student1",
      "Student2", "Student4"), Class = rep("Math", 4), stringsAsFactors = F)

edited Aug 17 '18 at 07:51

answered Aug 17 '18 at 07:27

Ronak Shah

377,200
20
156
213

also `tapply()` can do it – jogo Aug 17 '18 at 07:29
In this case, how can I convert the result to `df`? – Cahidora Aug 17 '18 at 07:43
With the updated answer, I just get the `scores` without `student#`... do you get `students` and `scores`? – Cahidora Aug 17 '18 at 07:48
@Cahidora Those were row names actually, added an option with `transform` instead, sorry for the confusion. – Ronak Shah Aug 17 '18 at 07:52
Simple, fast and no libraries needed... Works great! Thanks. – Cahidora Aug 17 '18 at 07:54
One last thing... in case I have an extra column right next to `Class` in `dat` and I want to include it in the output, where can I add it? – Cahidora Aug 17 '18 at 08:16
@Cahidora On what basis do you want to select the value in additional column? Would it be corresponding value of `max` ? – Ronak Shah Aug 17 '18 at 08:25
It is just a column with extra info corresponding to the `max` value. – Cahidora Aug 17 '18 at 08:26
1

@Cahidora Sorry, my mind is bit blocked right now, I could not come up with anything better than `cbind(d, t(sapply(d$Cases, function(x) { sub_df = dat[dat$Cases %in% x,]; inds = which.max(sub_df$Scores); c(sub_df$Scores[inds], sub_df$New[inds]) })))` at the moment where `New` is the additional column you want to add. – Ronak Shah Aug 17 '18 at 08:39
@RonakShah works great to add the columns... you're a genius! Thank you very much! – Cahidora Aug 17 '18 at 08:46

score 3 · Answer 2 · answered Aug 17 '18 at 07:20

3

If I am correct you don't need d, since in d there is no additional information that is not in dat already.

You can just do:

dat_output <- aggregate(Scores ~ Cases, dat, max)
dat_output

     Cases Scores
1 Student1      7
2 Student2      8
3 Student3      9
4 Student4      7

answered Aug 17 '18 at 07:20

Lennyy

5,932
2
10
23

1

There might be certain entries in `d` which are not present in `dat`. Not necessary the sample represents the actual data. – Ronak Shah Aug 17 '18 at 07:32

Sandipan Dey · Answer 3 · 2018-08-17T07:46:56.857

3

with dplyr, and considering the case where your d contains a subset of students from your dat

library(dplyr)
inner_join(d, dat %>% group_by(Cases, Class) %>% summarize(Scores=max(Scores)))

# Cases Class Scores
#1 Student3  Math      9
#2 Student1  Math      7
#3 Student2  Math      8
#4 Student4  Math      7

if order does not matter then the following is more efficient:

inner_join(dat, d) %>% group_by(Cases, Class) %>% summarize(Scores=max(Scores))
# A tibble: 4 x 3
# Groups:   Cases [?]
#  Cases    Class Scores
#  <chr>    <chr>  <dbl>
#1 Student1 Math       7
#2 Student2 Math       8
#3 Student3 Math       9
#4 Student4 Math       7

edited Aug 17 '18 at 07:46

answered Aug 17 '18 at 07:30

Sandipan Dey

21,482
2
51
63

This is the answer I prefer. I suggest do the join first then pipe the merged data to group_by, etc. – zx8754 Aug 17 '18 at 07:43
1

Yes but that will change the order of the elements in your final output (order will be different from `d`), although it will be efficient in terms of query optimization etc. – Sandipan Dey Aug 17 '18 at 07:45
Good point, I prefer efficiency over how the data is ordered. – zx8754 Aug 17 '18 at 07:47

score 3 · Answer 4 · answered Aug 17 '18 at 07:43

You can also use sqldf package as follow:

sqldf("select max(Scores), Cases from dat JOIN d USING(Cases) group by Cases")

Apply JOIN operation, group by cases and select max(Scores),Cases to get the desired output:

   max(Scores)    Cases
1           7    Student1
2           8    Student2
3           9    Student3
4           7    Student4

milan · Answer 5 · 2018-08-18T08:52:28.140

You can sort your dataframe on Scores in descending order using order. Then remove duplicate Cases. This is a base R solution.

dat <- dat[order(-dat$Scores),]
dat[duplicated(dat$Cases)==F,]

     Cases Class Scores
1 Student3  Math      9
6 Student2  Math      8
4 Student1  Math      7
9 Student4  Math      7

If you first want to make sure that all samples in dat are also in d, you can, in a first step, do this. %in% performs value matching. It, however, does not make a difference based on the example above.

dat <- dat[dat$Cases %in% d$Cases & dat$Class %in% d$Class,]

tmfmnk · Answer 6 · 2019-07-26T07:47:07.730

1

Using dplyr:

df %>% 
  group_by(Cases, Class) %>% 
  summarise(Scores = max(Scores))

# A tibble: 4 x 3
# Groups:   Cases [?]
  Cases    Class Scores
  <chr>    <chr>  <dbl>
1 Student1 Math      7.
2 Student2 Math      8.
3 Student3 Math      9.
4 Student4 Math      7.

Considering you want to match the two dfs:

df %>%  
  right_join(df2, by = c("Cases", "Class")) %>% 
  group_by(Cases, Class) %>% 
  summarise(Scores = max(Scores))

# A tibble: 4 x 3
# Groups:   Cases [?]
  Cases    Class Scores
  <chr>    <chr>  <dbl>
1 Student1 Math      7.
2 Student2 Math      8.
3 Student3 Math      9.
4 Student4 Math      7.

edited Jul 26 '19 at 07:47

answered Aug 17 '18 at 07:29

tmfmnk

38,881
4
47
67

you need to group by with class too, in case there are multiple class values – Sandipan Dey Aug 17 '18 at 07:33
After edits, now this looks like copy of @SandipanDey 's answer. – zx8754 Aug 17 '18 at 07:48
I think it is obvious that we are using (slightly) different approaches. – tmfmnk Aug 17 '18 at 07:56

zx8754 · Answer 7 · 2018-08-17T07:45:49.547

1

Using dplyr, group by students, and get first value based on scores:

library(dplyr)

dat %>% 
  filter(Cases %in% d$Cases) %>% 
  group_by(Cases) %>% 
  top_n(1, Scores) %>%
  ungroup()

# # A tibble: 4 x 3
#   Cases    Class Scores
#   <chr>    <chr>  <dbl>
# 1 Student1 Math       7
# 2 Student2 Math       8
# 3 Student3 Math       9
# 4 Student4 Math       7

edited Aug 17 '18 at 07:45

answered Aug 17 '18 at 07:39

zx8754

52,746
12
114
209

Select highest values in a dataframe by group

7 Answers7