From the original sample dataset:
id <- c('1','1','2', '2', '3', '3', '3')
month <- c('6', '6', '3', '3', '4', '4', '4')
iso <- c('MEX', 'USA', 'CRI', 'SPA', 'CHN', 'MEX', 'SPA')
value <- c('1550', '1550', '384', '115', '1100', '1100', '1100')
original <- data.frame(id, month, iso, value)
I want to end up with only 1 observation for each id-month pair. The rule to follow is:
- Choose the maximum value for each id-month pair.
- In case there is the same maximum value for the different observations of the same id-month pair, I want to choose one of the rows at random.
Therefore, the new sample dataset will look like this:
id <- c('1', '2', '3')
month <- c('6', '3', '4')
iso <- c('USA', 'CRI','MEX')
value <- c('1550', '384', '1100')
selection_criteria <- c('random','max_value','random')
new <- data.frame(id, month, iso, value, selection_criteria)
I have tried to run the following code:
new <- original %>% group_by(id, month) %>%
filter(value == max(value))
However, it does not make the trick of selecting one variable at random when I have more than one observation (for the same id-month pair) with a maximum value.
My intention is to automatise the process given the large dimension of my dataset.
Any clue?
Thank you.