uniqid | client_id | hh_id | group_id |
---|---|---|---|
u1 | c1 | h1 | 1 |
u1 | c2 | h1 | 1 |
u1 | c3 | h2 | 1 |
u2 | c4 | h1 | 1 |
u2 | c5 | h2 | 1 |
u3 | c6 | h3 | 2 |
u3 | c7 | h3 | 2 |
u3 | c8 | h4 | 2 |
Let's say a Household can have more than 1 individual as a part of it, with each of them having one master unique-id in the system. But each individual can have more than 1 client-id generated in the system due to the process/workflow. Also there are rare cases where the same client is mapped to more than 1 household.
Expected outcome is to group all the related individuals together into one group, say g1, such that all the individuals who are part of a household (or overlapping with other households), are there in one place.
Dataset:
df <- data.frame(list(uniqid = c("u1", "u1", "u1", "u2", "u2", "u3", "u3", "u3"),
client_id = c("c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8"),
hh_id = c("h1", "h1", "h2", "h1", "h2", "h3", "h3", "h4"),
group_id = c(1,1,1,1,1,2,2,2)))
Group_id is the expected output, one unique id per group of related individuals (or household).
I have tried this approach to group the individuals, solving part of the problem, but this would miss out other household-ids the individual is mapped to.
df %>% group_by(hh_id) %>%
arrange(hh_id, uniqid) %>%
mutate(hh_group = str_c(uniqid, collapse = ""))