I am trying to write a section of code that does a few things: 1) group dataset by ID 2) count the number of unique months in column data.month 3) remove all IDs that have less than 9 months 4) print distinct IDs based on the company (ie print ID twice if related to 2 companies) 5) remove duplicated ID and keep the record that has the highest data.month number.
I have the code working until 5). I cant get my code to only print the record (row) of duplicate IDs that has the highest month number.
I looked at a few examples here:
R remove duplicates based on other columns
Remove duplicates based on 2nd column condition
I can figure out how to remove duplicates, but I'm having trouble applying it to my circumstances.
This is the two codes I have tried to achieve my goal:
data.check6 <- bind %>%
group_by(bind$ABN) %>%
summarise(count = n_distinct(data.month)) %>%
filter(count>8) %>%
rrange(bind$data.month) %>%
filter(row_number() == 1)
and:
library(tidyverse)
data.check7 <- bind %>%
group_by(ABN)%>%
filter(1 == length(unique(bind$data.month)), !duplicated(bind$data.month))
Right now, I get the error:
Error in arrange_impl(.data, dots) : incorrect size (345343) at position 1, expecting : 3749
In the end I would like to have a dataset where each ID only appears once and it is the ID record associated with the highest month (ie. column value = 12)