0

this is my first post here. I have a large dataset and I am trying to remove duplicate rows based on the value of one of the specified variables (ERRaw). When I use the following code, the resulting dataset excludes some cases that did not have duplicates in the original -- don't understand why. I need to keep all singleton cases and only remove duplicates. Please help!

new_data <- data_with_dups %>% 
  group_by(StudentID, District) %>% 
  distinct(StudentID, ERRaw, .keep_all = T) %>%
  top_n(1, ERRaw) 

Thank you!

  • 2
    To remove all groups with more than one obs, you can do `group_by(stuff) %>% filter(n() > 1)` ..? If that's not it, maybe you could make an illustrative example. – Frank Jul 30 '18 at 17:08

1 Answers1

0

I think any of these should work. If you provide copy/pasteable sample data, I'll test and make sure.

# group_by and top_n
new_data <- data_with_dups %>% 
  group_by(StudentID, District) %>% 
  arrange(desc(ERRaw)) %>%
  top_n(1) 

# base R sort, !duplicated
new_data = data_with_dups[order(data_with_dups$ERRaw, decreasing = TRUE), ]
new_data = new_data[!duplicated(new_data[c("StudentID", "District")]), ]
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Thank you. These solutions do not keep the row with the largest (or non-NA) ERRaw value. And there are still duplicate values of StudentID. I am trying to (1) keep all non-duplicated records, (2) out of the ones with duplicates, select the row with the highest (or non-NA) value of ERRaw. Does this help? thank you!! – user3703717 Jul 30 '18 at 17:53
  • 1
    Yes, that is useful information - you should put it into your question. I thought you wanted to remove duplicate ERRaw values, not remove duplicate Student/District IDs keeping the highest ERRaw value. I'll edit the answer to reflect my new understanding. – Gregor Thomas Jul 30 '18 at 19:03