0

I have a data.frame as shown below.

> df2 <- data.frame("StudentId" = c(1,1,1,2,2,3,3), "Subject" = c("Maths", "Maths", "English","Maths", "English", "Science", "Science"), "Score" = c(100,90,80,70, 60,20,10))
> df2
  StudentId Subject Score
1         1   Maths   100
2         1   Maths    90
3         1 English    80
4         2   Maths    70
5         2 English    60
6         3 Science    20
7         3 Science    10

Few StudentIds, have duplicated values for column Subject (example: ID 1 has 2 entries for "Maths". I need to keep only the first one of the duplicated rows. The expected data.frame is:

  StudentId Subject Score
1         1   Maths   100
3         1 English    80
4         2   Maths    70
5         2 English    60
6         3 Science    20

I am not able to do this. Any ideas.

sachinv
  • 492
  • 2
  • 5
  • 18
  • 1
    Also [this](http://stackoverflow.com/questions/13967063/remove-duplicate-rows-in-r) and [this](http://stackoverflow.com/questions/13279582/select-only-the-first-rows-for-each-unique-value-of-a-column-in-r) – David Arenburg Feb 08 '16 at 17:19

1 Answers1

5

We can either use unique from data.table with the by option after converting to 'data.table' (setDT(df2))

library(data.table)
unique(setDT(df2), by = c("StudentId", "Subject"))
#   StudentId Subject Score
#1:         1   Maths   100
#2:         1 English    80
#3:         2   Maths    70
#4:         2 English    60
#5:         3 Science    20

Or distinct from 'df2'

library(dplyr)
distinct(df2, StudentId, Subject)
#     StudentId Subject Score
#       (dbl)  (fctr) (dbl)
#1         1   Maths   100
#2         1 English    80
#3         2   Maths    70
#4         2 English    60
#5         3 Science    20

Or duplicated from base R

df2[!duplicated(df2[1:2]),]

EDIT: Based on suggestions by @David Arenburg)

akrun
  • 874,273
  • 37
  • 540
  • 662