5

I have the following df

dat <- data.frame(Cases = c("Student3","Student3","Student3","Student1","Student1",
"Student2","Student2","Student2","Student4"), Class = rep("Math", 9),
Scores = c(9,5,2,7,3,8,5,1,7), stringsAsFactors = F)


> dat
   Cases    Class   Scores
1 Student3  Math      9
2 Student3  Math      5
3 Student3  Math      2
4 Student1  Math      7
5 Student1  Math      3
6 Student2  Math      8
7 Student2  Math      5
8 Student2  Math      1
9 Student4  Math      7

On the other hand, I have another df with the following information:

d <- data.frame(Cases = c("Student3", "Student1",
"Student2", "Student4"), Class = rep("Math", 4), stringsAsFactors = F)

    Cases  Class
1 Student3  Math
2 Student1  Math
3 Student2  Math
4 Student4  Math

With these two, I want to extract the highest scores for each student. So my output would look like this:

> dat_output
    Cases  Class   Scores
1 Student3  Math      9
2 Student1  Math      7
3 Student2  Math      8
4 Student4  Math      7

I tried with merge but it is not extracting just the highest scores.

zx8754
  • 52,746
  • 12
  • 114
  • 209
Cahidora
  • 143
  • 7
  • I don't think it is a duplicate because I never asked to use a library. Plus, the accepted answer was written on `base R`, not on `dplyr`. – Cahidora Aug 17 '18 at 08:21
  • 2
    possible duplicate then of https://stackoverflow.com/questions/24558328/how-to-select-the-row-with-the-maximum-value-in-each-group – tjebo Aug 17 '18 at 08:23
  • 2
    and https://stackoverflow.com/questions/27534284/filter-dataframe-by-maximum-values-in-each-group/27534390 – tjebo Aug 17 '18 at 08:24
  • 4
    @Tjebo Yes, there are many other posts with "find max per group", but this post has "filter" step: only do this for subset of students that are also in d dataframe. – zx8754 Aug 17 '18 at 08:25
  • 4
    @Tjebo please read the question carefully, it doesn't ask for maximum in group. – Ronak Shah Aug 17 '18 at 08:26
  • 1
    @RonakShah the title speaks a different language to me. – tjebo Aug 17 '18 at 08:27
  • 1
    @zx8754 Even if the list of groups that the maximum is to be found for come from a different dataframe, it's still a "maximum in group" question that has numerous duplicates (8+ just searching with the current title). The "filter" step is not reflected in the question title, and to be honest should be a separate question anyway (which also [has been asked before](https://stackoverflow.com/questions/11612235/select-rows-from-a-data-frame-based-on-values-in-a-vector)). – Mikko Marttila Aug 17 '18 at 10:21
  • @MikkoMarttila do (vote) as you wish, of course. Agree to disagree. – zx8754 Aug 17 '18 at 10:23

7 Answers7

6

We can use sapply on each Cases in d, subset the dat for that Cases and get the max score for it.

sapply(d$Cases, function(x) max(dat$Scores[dat$Cases %in% x]))

#Student3 Student1 Student2 Student4 
#       9        7        8        7 

To get the result as data.frame

transform(d, Scores = sapply(d$Cases, function(x) 
                     max(dat$Scores[dat$Cases %in% x])))

#    Cases Class Scores
# Student3  Math      9 
# Student1  Math      7
# Student2  Math      8
# Student4  Math      7

Note - I have assumed your d to be

d <- data.frame(Cases = c("Student3", "Student1",
      "Student2", "Student4"), Class = rep("Math", 4), stringsAsFactors = F)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • also `tapply()` can do it – jogo Aug 17 '18 at 07:29
  • In this case, how can I convert the result to `df`? – Cahidora Aug 17 '18 at 07:43
  • With the updated answer, I just get the `scores` without `student#`... do you get `students` and `scores`? – Cahidora Aug 17 '18 at 07:48
  • @Cahidora Those were row names actually, added an option with `transform` instead, sorry for the confusion. – Ronak Shah Aug 17 '18 at 07:52
  • Simple, fast and no libraries needed... Works great! Thanks. – Cahidora Aug 17 '18 at 07:54
  • One last thing... in case I have an extra column right next to `Class` in `dat` and I want to include it in the output, where can I add it? – Cahidora Aug 17 '18 at 08:16
  • @Cahidora On what basis do you want to select the value in additional column? Would it be corresponding value of `max` ? – Ronak Shah Aug 17 '18 at 08:25
  • It is just a column with extra info corresponding to the `max` value. – Cahidora Aug 17 '18 at 08:26
  • 1
    @Cahidora Sorry, my mind is bit blocked right now, I could not come up with anything better than `cbind(d, t(sapply(d$Cases, function(x) { sub_df = dat[dat$Cases %in% x,]; inds = which.max(sub_df$Scores); c(sub_df$Scores[inds], sub_df$New[inds]) })))` at the moment where `New` is the additional column you want to add. – Ronak Shah Aug 17 '18 at 08:39
  • @RonakShah works great to add the columns... you're a genius! Thank you very much! – Cahidora Aug 17 '18 at 08:46
3

If I am correct you don't need d, since in d there is no additional information that is not in dat already.

You can just do:

dat_output <- aggregate(Scores ~ Cases, dat, max)
dat_output

     Cases Scores
1 Student1      7
2 Student2      8
3 Student3      9
4 Student4      7
Lennyy
  • 5,932
  • 2
  • 10
  • 23
  • 1
    There might be certain entries in `d` which are not present in `dat`. Not necessary the sample represents the actual data. – Ronak Shah Aug 17 '18 at 07:32
3

with dplyr, and considering the case where your d contains a subset of students from your dat

library(dplyr)
inner_join(d, dat %>% group_by(Cases, Class) %>% summarize(Scores=max(Scores)))

# Cases Class Scores
#1 Student3  Math      9
#2 Student1  Math      7
#3 Student2  Math      8
#4 Student4  Math      7

if order does not matter then the following is more efficient:

inner_join(dat, d) %>% group_by(Cases, Class) %>% summarize(Scores=max(Scores))
# A tibble: 4 x 3
# Groups:   Cases [?]
#  Cases    Class Scores
#  <chr>    <chr>  <dbl>
#1 Student1 Math       7
#2 Student2 Math       8
#3 Student3 Math       9
#4 Student4 Math       7
Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
  • This is the answer I prefer. I suggest do the join first then pipe the merged data to group_by, etc. – zx8754 Aug 17 '18 at 07:43
  • 1
    Yes but that will change the order of the elements in your final output (order will be different from `d`), although it will be efficient in terms of query optimization etc. – Sandipan Dey Aug 17 '18 at 07:45
  • Good point, I prefer efficiency over how the data is ordered. – zx8754 Aug 17 '18 at 07:47
3

You can also use sqldf package as follow:

sqldf("select max(Scores), Cases from dat JOIN d USING(Cases) group by Cases")

Apply JOIN operation, group by cases and select max(Scores),Cases to get the desired output:

   max(Scores)    Cases
1           7    Student1
2           8    Student2
3           9    Student3
4           7    Student4
Saurabh Chauhan
  • 3,161
  • 2
  • 19
  • 46
3

You can sort your dataframe on Scores in descending order using order. Then remove duplicate Cases. This is a base R solution.

dat <- dat[order(-dat$Scores),]
dat[duplicated(dat$Cases)==F,]

     Cases Class Scores
1 Student3  Math      9
6 Student2  Math      8
4 Student1  Math      7
9 Student4  Math      7

If you first want to make sure that all samples in dat are also in d, you can, in a first step, do this. %in% performs value matching. It, however, does not make a difference based on the example above.

dat <- dat[dat$Cases %in% d$Cases & dat$Class %in% d$Class,]
milan
  • 4,782
  • 2
  • 21
  • 39
1

Using dplyr:

df %>% 
  group_by(Cases, Class) %>% 
  summarise(Scores = max(Scores))

# A tibble: 4 x 3
# Groups:   Cases [?]
  Cases    Class Scores
  <chr>    <chr>  <dbl>
1 Student1 Math      7.
2 Student2 Math      8.
3 Student3 Math      9.
4 Student4 Math      7.

Considering you want to match the two dfs:

df %>%  
  right_join(df2, by = c("Cases", "Class")) %>% 
  group_by(Cases, Class) %>% 
  summarise(Scores = max(Scores))

# A tibble: 4 x 3
# Groups:   Cases [?]
  Cases    Class Scores
  <chr>    <chr>  <dbl>
1 Student1 Math      7.
2 Student2 Math      8.
3 Student3 Math      9.
4 Student4 Math      7.
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
1

Using dplyr, group by students, and get first value based on scores:

library(dplyr)

dat %>% 
  filter(Cases %in% d$Cases) %>% 
  group_by(Cases) %>% 
  top_n(1, Scores) %>%
  ungroup()

# # A tibble: 4 x 3
#   Cases    Class Scores
#   <chr>    <chr>  <dbl>
# 1 Student1 Math       7
# 2 Student2 Math       8
# 3 Student3 Math       9
# 4 Student4 Math       7
zx8754
  • 52,746
  • 12
  • 114
  • 209