R how to compare two columns for duplicates

Question

I have data with repeated observations that sometimes match on two elements but differ on a third, and sometimes match only on the first. For example:

name <- c("John", "Mary", "Anna", "Anna", "John", "Mary", "Anna", "John")
sport <- c("soccer", "basketball", "tennis", "tennis", "soccer", "soccer", "badminton", "basketball")
time <- c(41, 5, 10, 61, 1, 12, 18, 99)
data <- cbind(name, sport, time)

name    sport       time
John   soccer        41
Mary   basketball    5
Anna   tennis        10 
Anna   tennis        61 
John   soccer        1
Mary   soccer        12
Anna   badminton     18
John   basketball    99

For each observation that matches on the first two columns (e.g. here, on both name and sport), I want to keep only the observation with the greatest time value. For those that match only on the first column (e.g. name), I want to keep them as is. For example:

name    sport       time
John   soccer        41
Mary   basketball    5
Anna   tennis        61 
Mary   soccer        12
Anna   badminton     18
John   basketball    99

How would I do this?

`as.matrix(as.data.frame(data) %>% group_by(name, sport) %>% top_n(1, time))` — M--, Feb 25 '19 at 20:29

score 0 · Accepted Answer · edited Feb 25 '19 at 20:32

0

One suggestion, instead of what you have:

data <- data.frame(name, sport, time)

Execute below to see what's happening

sapply(data, class)

cbind by default coerces everything to class of character, you don't want that.

I am summarizing the values (time) by grouping on name and sport and naming the variable time. also use na.rm = T to exclude cases in your data where you might have a missing value in variable time

#dplyr version
library(dplyr)
data %>% group_by(name, sport) %>%
    summarize(time = max(time, na.rm = T))

Aggregate suggested in the comments above works as well, but I find dplyr syntax easier to read

edited Feb 25 '19 at 20:32

M--

25,431
8
61
93

answered Feb 25 '19 at 20:26

infominer

1,981
13
17

*`cbind` by default coerces everything to class of character, you don't want that.* I am not aware of that behaviour. can you point me to your reference? I know the following tho: ***The cbind data frame method is just a wrapper for data.frame(..., check.names = FALSE). This means that it will split matrix columns in data frame arguments, and convert character columns to factors unless stringsAsFactors = FALSE is specified.*** – M-- Feb 25 '19 at 20:33
p.s. found it!!! – M-- Feb 25 '19 at 21:21
This is great & very helpful. Thank you!! – quantoid6969 Feb 26 '19 at 20:14

R how to compare two columns for duplicates

1 Answers1