Is there an R function for dropping duplicates of index variable based on lowest value in another column?

Question

I am trying to analyse large data-sets of student scores. Some students do retakes which produces duplicate scores, usually with the earlier low score placed the row above their retake, usually higher, score. I want to select their highest score, with a file that has only one line per student (which I will need to merge with other files having same ids).
Source file is like this:

STUDID   MATRISUBJ  SUBJSCORE
1032        AfrikaansB  2
1032        isiZuluB    7
1033        IsiXhosaB   6
1034        AfrikaansB  1
1034        EnglishB    4
1034        isiZuluB    3

result should look like this.

STUDID  MATRISUBJ   SUBJSCORE
1032        isiZuluB    7
1033        isiXhosaB   6
1034        EnglishB    4

Help, please..I used to do this process in SPS but now can't get access to this commercialised software, so am swapping to R

df2[!duplicated(df2[1:1]),]

gives the first row of the duplicate but I want the one with highest value, and sometimes student tries with another subject to get required score in languages

Please provide a reproducible example. – Cyrus Mohammadian Dec 28 '18 at 22:27 — Cyrus Mohammadian, Dec 28 '18 at 22:27

score 1 · Answer 1 · answered Dec 29 '18 at 00:33

1

Heyo! The simplest solution would be to use the top_n() function. This will allow you to choose the top n scores based on a numeric column (in your case SUBJSCORE)

The following code will give you what you need :)

  library(tidyverse)

  df %>% 
    group_by(STUDID) %>% 
    top_n(1, SUBJSCORE)

answered Dec 29 '18 at 00:33

thus__

460
3
16

1

Many thanks - that works nicely ! It reduced the file from 3241 rows to 3060, meaning 181 were retakes. It saved the highest score for each retake. – CharlotteM Dec 29 '18 at 10:07

score 0 · Answer 2 · answered Dec 28 '18 at 22:45

0

You could use something like:

 df %>%
 dplyr::group_by(., STUDID) %>%                 
dplyr::arrange(.,desc(score) %>%
.[1,]

answered Dec 28 '18 at 22:45

Curtis

449
1
4
17

score 0 · Answer 3 · answered Dec 28 '18 at 22:46

I typically do something like this using the tidyverse group of packages:

library(tidyverse)

df <- data.frame(id = c('a','a','a','b','b','c','c','c')
             , score = c(90,92,93,75,87,67,68,73)
             , tesno = c(1,2,3,1,2,1,2,3))

df %>% group_by(id) %>% arrange(desc(score)) %>% filter(row_number() == 1) %>%  ungroup()

yearntolearn · Answer 4 · 2018-12-29T00:01:53.593

library(tidyverse)

data <- data.frame(
  STUDID = c(1032, 1032, 1033, 1034, 1034, 1034),
  MARISUBJ = c("AfrikaansB","isiZuluB", "IsiXhosaB", "AfrikaansB", "EnglishB", "isiZuluB"),
  SUBJSCORE = c(2, 7, 6, 1, 4, 3)
)


srow <- function(x) {
  r <- which(x$SUBJSCORE == max(x$SUBJSCORE))
  x[r,]
}

dd <- data %>% split(.$STUDID) %>% map(~srow(.)) %>% bind_rows(.$STUDID)
dd

   STUDID  MARISUBJ SUBJSCORE
1   1032  isiZuluB         7
2   1033 IsiXhosaB         6
3   1034  EnglishB         4

score 0 · Answer 5 · answered Dec 28 '18 at 23:29

Here's a short one-line solution, once the data is a data.table:

library(data.table)

data <- data.table(
  STUDID = c(1032, 1032, 1033, 1034, 1034, 1034),
  MARISUBJ = c("AfrikaansB","isiZuluB", "IsiXhosaB", "AfrikaansB", "EnglishB", "isiZuluB"),
  SUBJSCORE = c(2, 7, 6, 1, 4, 3)
)

data[, .SD[which.max(SUBJSCORE)], by = "STUDID"]

Is there an R function for dropping duplicates of index variable based on lowest value in another column?

5 Answers5

Linked