1

I have a not complicated problem, I think, but my knowledge of R is pretty basic and so I can't find an answer. I have 4 variables. One is a grouping variable I call cluster. The other 3 (ID, IDman, IDwoman) are IDs of individuals. Something like this:

cluster <- c("a", "a", "a", "b", "b", "b", "c", "c", "c")

ID <- c(1, 7, 18, 3, 3, 9, 25, 10, 19)

IDman <- c(1, 2, 3, 3, 3, 4, 10, 10, 6)

IDwoman <- c(5, 7, 9, 11, 12, 14, 19,19,5)

households <- data.frame(cluster, ID, IDman, IDwoman)

The dataframe (household) is basically showing the individuals (ID) that are in a household (cluster). Sometimes, those individuals are a marriage, and this information is given by a certain combination of IDman and IDwoman: it happens when ID equals IDman and ID equals IDwoman within the same cluster. For example, for the first cluster (cluster=a, or first 3 rows) there is a marriage. IDman=1 and IDwoman=7 are a marriage because they are in the same household (cluster=a) and because ID and IDman equal 1 in the first row, but also ID and IDwoman equal 7 in the second (all of it happening within cluster a).

So, what I need is to find the number of unique combinations for each cluster of ID-equals-IDman and ID-equals-IDwoman. For instance,in the second cluster, we have none (as there is no IDwoman=9), and in the third cluster we have again one, as IDman=10 and IDwoman=19 appear both in ID, and the repetition of the observation IDman=10 and IDwoman=19 is not taken into account. The outcome doesn't need to be dataset showing these links. Just the number of these unique combinations per cluster.

I don't know how to solve this. I was trying things through apply or sapply functions, but none worked.

Any idea is very welcome.

Thank you!

Uwe
  • 41,420
  • 11
  • 90
  • 134
Paco
  • 65
  • 9
  • 1
    Please add a good [mininmal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610). That way you can help others to help you! (for example its not totaly clear what `datatable` in your question should be and the definition of `cluster` was missing all the quotation... – dario Feb 22 '20 at 07:15
  • Sorry about it. Hope now it's enough. – Paco Feb 22 '20 at 16:50
  • 1
    I'm still not sure if I understand what you want to do. It would really help if you showed how the resulting data.frame should look like? If we use the example dataset `households` and do the *thing* what will the result look like? – dario Feb 22 '20 at 17:21
  • Thank you for your time and question. I said before in the statement that what I'm looking for is the number of unique combinations of ID-equals-IDman and ID-eguals-IDwoman per cluster. I'd need to calculate this number. I'm not looking for an outcome dataset. – Paco Feb 22 '20 at 17:41
  • Does the suggestion from @Parfait solve your problem? – dario Feb 22 '20 at 18:07
  • I think it does. Yes. Thank you. – Paco Feb 22 '20 at 22:46
  • Just to understand the logic of your requirement: You are referring to the condition *ID-equals-IDwoman* but for the third cluster, there is *no row* which fulfills the condition *ID-equals-IDwoman*. Or do just want to check if `IDwoman` is contained in any of the `ID`s which belong to `cluster`?. – Uwe Feb 23 '20 at 15:44

1 Answers1

1

Consider assigning marriage column with ave (in-line aggregation by groups) where max is used to return any TRUE values.

households <- within(households, {    
    man <- ave(IDman %in% ID, cluster, FUN=max)
    woman <- ave(IDwoman %in% ID, cluster, FUN=max)
    marriage <- man == 1 & woman == 1

    rm(man, woman)    
})

households
#   cluster ID IDman IDwoman marriage
# 1       a  1     1       5     TRUE
# 2       a  7     2       7     TRUE
# 3       a 18     3       9     TRUE
# 4       b  3     3      11    FALSE
# 5       b  3     3      12    FALSE
# 6       b  9     4      14    FALSE
# 7       c 10    10      19     TRUE
# 8       c 19     6       5     TRUE
# 9       c 25    10      19     TRUE

And for unique combinations, filter data frame accordingly by rows and columns, then run unique:

unique(households[households$marriage == TRUE,
                  c("cluster", "marriage")])

#   cluster marriage
# 1       a     TRUE
# 7       c     TRUE
Parfait
  • 104,375
  • 17
  • 94
  • 125