1

I have data with repeated observations that sometimes match on two elements but differ on a third, and sometimes match only on the first. For example:

name <- c("John", "Mary", "Anna", "Anna", "John", "Mary", "Anna", "John")
sport <- c("soccer", "basketball", "tennis", "tennis", "soccer", "soccer", "badminton", "basketball")
time <- c(41, 5, 10, 61, 1, 12, 18, 99)
data <- cbind(name, sport, time)

name    sport       time
John   soccer        41
Mary   basketball    5
Anna   tennis        10 
Anna   tennis        61 
John   soccer        1
Mary   soccer        12
Anna   badminton     18
John   basketball    99

For each observation that matches on the first two columns (e.g. here, on both name and sport), I want to keep only the observation with the greatest time value. For those that match only on the first column (e.g. name), I want to keep them as is. For example:

name    sport       time
John   soccer        41
Mary   basketball    5
Anna   tennis        61 
Mary   soccer        12
Anna   badminton     18
John   basketball    99

How would I do this?

1 Answers1

0

One suggestion, instead of what you have:

data <- data.frame(name, sport, time)

Execute below to see what's happening

sapply(data, class)

cbind by default coerces everything to class of character, you don't want that.

I am summarizing the values (time) by grouping on name and sport and naming the variable time. also use na.rm = T to exclude cases in your data where you might have a missing value in variable time

#dplyr version
library(dplyr)
data %>% group_by(name, sport) %>%
    summarize(time = max(time, na.rm = T))

Aggregate suggested in the comments above works as well, but I find dplyr syntax easier to read

M--
  • 25,431
  • 8
  • 61
  • 93
infominer
  • 1,981
  • 13
  • 17
  • *`cbind` by default coerces everything to class of character, you don't want that.* I am not aware of that behaviour. can you point me to your reference? I know the following tho: ***The cbind data frame method is just a wrapper for data.frame(..., check.names = FALSE). This means that it will split matrix columns in data frame arguments, and convert character columns to factors unless stringsAsFactors = FALSE is specified.*** – M-- Feb 25 '19 at 20:33
  • p.s. found it!!! – M-- Feb 25 '19 at 21:21
  • This is great & very helpful. Thank you!! – quantoid6969 Feb 26 '19 at 20:14