Subset removing duplicated field combinations keeping the one with highest score

Question

I'm learning R, any help will be welcomed.

I need to clean my data, removing the duplicated combinations of project-raingauge. My data has a larger number of variables but I present a simplified version:

ID_project <- c(1,1,1,1,2,2,2,2)
ID_raingauge <- c("A","B","B","B","A","A","B","C")
COMB_check <- c("|","|","ok","ok","|","ok","|","|")
score<- c(0.7,0.5,1.2,0.3,0.4,0.1,0.6,1.4)
mydata <- data.frame(ID_project,ID_raingauge,COMB_check,score)

ID_project ID_rainguage COMB_check score
======
1         A           |        0.7
1         B           |        0.5
1         B           ok       1.2
1         B           ok       0.3
2         A           |        0.4
2         A           ok       0.1
2         B           |        0.6
2         C           |        1.4

For each combination of ID_project and ID_raingauge in some cases we have more than one score. You can notice the repeated combination in COMB_check. The first time one combination appears COMB_check= "|", while next rows with the same combination COMB_check="ok".

I want to obtain the same data but just one example for each combination (ID_project-ID_raingauge). I need to keep the one with the highest score. The example result would be:

ID_project ID_rainguage COMB_check score
======
1         A           |        0.7
1         B           ok       1.2
2         A           |        0.4
2         B           |        0.6
2         C           |        1.4

Thank you in advance

Fatih Ekici · Accepted Answer · 2021-03-15T13:33:40.603

0

   library(dplyr)   
mydata %>% group_by(ID_project,ID_raingauge) %>% 
      summarise(max(score))

Result

     ID_project ID_raingauge `max(score)`
       <dbl> <chr>               <dbl>
1          1 A                     0.7
2          1 B                     1.2
3          2 A                     0.4
4          2 B                     0.6
5          2 C                     1.4

if not clear, i would explain what ı do

shortly you want to find max score by 2 specific columns, so I grouped that columns and created table which just max value of score table.

I hope I got it right.

edited Mar 15 '21 at 13:33

answered Mar 15 '21 at 13:26

Fatih Ekici

123
8

Thank you Fatih for your quick answer. This works perfectly for the simplified version of my data. However, I just get the data of the 3 fields (ID_project, ID_raingauge and max(score). How can I write it to keep the rest of fields? – Ferran Mar 15 '21 at 14:02
just change summarise to filter. mydata %>% group_by(ID_project,ID_raingauge) %>% filter(score == max(score)) – Fatih Ekici Mar 15 '21 at 14:12
That works perfectly, thank you so much! – Ferran Mar 15 '21 at 14:24

Subset removing duplicated field combinations keeping the one with highest score

1 Answers1