1

I am trying to analyse large data-sets of student scores. Some students do retakes which produces duplicate scores, usually with the earlier low score placed the row above their retake, usually higher, score. I want to select their highest score, with a file that has only one line per student (which I will need to merge with other files having same ids).
Source file is like this:

STUDID   MATRISUBJ  SUBJSCORE
1032        AfrikaansB  2
1032        isiZuluB    7
1033        IsiXhosaB   6
1034        AfrikaansB  1
1034        EnglishB    4
1034        isiZuluB    3

result should look like this.

STUDID  MATRISUBJ   SUBJSCORE
1032        isiZuluB    7
1033        isiXhosaB   6
1034        EnglishB    4

Help, please..I used to do this process in SPS but now can't get access to this commercialised software, so am swapping to R

df2[!duplicated(df2[1:1]),]

gives the first row of the duplicate but I want the one with highest value, and sometimes student tries with another subject to get required score in languages

pogibas
  • 27,303
  • 19
  • 84
  • 117
CharlotteM
  • 47
  • 4

5 Answers5

1

Heyo! The simplest solution would be to use the top_n() function. This will allow you to choose the top n scores based on a numeric column (in your case SUBJSCORE)

The following code will give you what you need :)

  library(tidyverse)

  df %>% 
    group_by(STUDID) %>% 
    top_n(1, SUBJSCORE)
thus__
  • 460
  • 3
  • 16
  • 1
    Many thanks - that works nicely ! It reduced the file from 3241 rows to 3060, meaning 181 were retakes. It saved the highest score for each retake. – CharlotteM Dec 29 '18 at 10:07
0

You could use something like:

 df %>%
 dplyr::group_by(., STUDID) %>%                 
dplyr::arrange(.,desc(score) %>%
.[1,]
Curtis
  • 449
  • 1
  • 4
  • 17
0

I typically do something like this using the tidyverse group of packages:

library(tidyverse)

df <- data.frame(id = c('a','a','a','b','b','c','c','c')
             , score = c(90,92,93,75,87,67,68,73)
             , tesno = c(1,2,3,1,2,1,2,3))

df %>% group_by(id) %>% arrange(desc(score)) %>% filter(row_number() == 1) %>%  ungroup()
0
library(tidyverse)

data <- data.frame(
  STUDID = c(1032, 1032, 1033, 1034, 1034, 1034),
  MARISUBJ = c("AfrikaansB","isiZuluB", "IsiXhosaB", "AfrikaansB", "EnglishB", "isiZuluB"),
  SUBJSCORE = c(2, 7, 6, 1, 4, 3)
)


srow <- function(x) {
  r <- which(x$SUBJSCORE == max(x$SUBJSCORE))
  x[r,]
}

dd <- data %>% split(.$STUDID) %>% map(~srow(.)) %>% bind_rows(.$STUDID)
dd

   STUDID  MARISUBJ SUBJSCORE
1   1032  isiZuluB         7
2   1033 IsiXhosaB         6
3   1034  EnglishB         4
yearntolearn
  • 1,064
  • 2
  • 17
  • 36
0

Here's a short one-line solution, once the data is a data.table:

library(data.table)

data <- data.table(
  STUDID = c(1032, 1032, 1033, 1034, 1034, 1034),
  MARISUBJ = c("AfrikaansB","isiZuluB", "IsiXhosaB", "AfrikaansB", "EnglishB", "isiZuluB"),
  SUBJSCORE = c(2, 7, 6, 1, 4, 3)
)

data[, .SD[which.max(SUBJSCORE)], by = "STUDID"]
mpjdem
  • 1,504
  • 9
  • 14