Count how often two factors have the same output value

Question

I want to calculate the number of times two individuals share the same group number. I'm working with quite a large dataset (169 individuals and over a 1000 observations (rows) of them) and I'm looking for an efficient way to count the occurrence of them being in the same group. My (simplified) data looks like this:

ID	Group number	Date	Time
Aa	1	15-06-22	15:05:22
Bd	1	15-06-22	15:05:27
Cr	2	15-06-22	15:07:12
Bd	1	15-06-22	17:33:15
Aa	2	15-06-22	17:36:54
Cr	2	15-06-22	17:37:01
...

I would like my output data to look like this:

Aa-Bd	Aa-Cr	Bd-Cr	...
1	1	0

Or:

Occurrence	Dyad
1	Aa-Bd; Aa-Cr
0	Bd-Cr

Or even a matrix might work. I've been trying to replicate the solution posed for this problem: Count occurrences of a variable having two given values corresponding to one value of another variable but for some reason my matrix remains empty, even though I know that certain individuals have been in groups with others.

Any help and suggestions would be extremely appreciated! I feel like the solution shouldn't be too complicated but for some reason I can't seem to figure it out.

Thanks in advance!

Edit: some example data from dput():

dput(c[1:5,])
structure(list(Date = structure(c(19129, 19129, 19129, 19129, 
19129), class = "Date"), Time = c("11:05:58", "11:06:06", "11:06:16", 
"11:06:33", "11:06:59"), Data = structure(c(1L, 1L, 1L, 1L, 1L
), .Label = "Crossing", class = "factor"), Group = structure(c(5L, 
5L, 5L, 5L, 5L), .Label = c("Ankhase", "Baie Dankie", "Kubu", 
"Lemon Tree", "Noha"), class = "factor"), IDIndividual1 =    structure(c(158L, 
158L, 34L, 153L, 14L), .Label = c("Aaa", "Aal", "Aan", "Aapi", 
"Aar", "Aara", "Aare", "Aat", "Amst", "App", "Asis", "Awa", "Beir", 
"Bela", "Bet", "Buk", "Daa", "Dais", "Dazz", "Deli", "Dewe", 
"Dian", "Digb", "Dix", "Dok", "Dore", "Eina", "Eis", "Enge", 
"Fle", "Flu", "Fur", "Gale", "Gaya", "Gese", "Gha", "Ghid", "Gib", 
"Gil", "Ginq", "Gobe", "Godu", "Goe", "Gom", "Gran", "Gree", 
"Gri", "Gris", "Griv", "Guat", "Gub", "Guba", "Gubh", "Guz", 
"Haai", "Hee", "Heer", "Heli", "Hond", "Kom", "Lail", "Lewe", 
"Lif", "Lill", "Lizz", "Mara", "Mas", "Miel", "Misk", "Moes", 
"Mom", "Mui", "Naal", "Nak", "Ncok", "Nda", "Ndaw", "Ndl", "Ndon", 
"Ndum", "Nge", "Nko", "Nkos", "Non", "Nooi", "Numb", "Nurk", 
"Nuu", "Obse", "Oerw", "Oke", "Ome", "Oort", "Ouli", "Oup", "Palm", 
"Pann", "Papp", "Pie", "Piep", "Pix", "Pom", "Popp", "Prai", 
"Prat", "Pret", "Prim", "Puol", "Raba", "Rafa", "Ram", "Rat", 
"Rede", "Ree", "Reen", "Regi", "Ren", "Reno", "Rid", "Rim", "Rioj", 
"Riss", "Riva", "Rivi", "Roc", "Sari", "Sey", "Sho", "Sig", "Sirk", 
"Sitr", "Skem", "Sla", "Spe", "Summary", "Syl", "Tam", "Ted", 
"Tev", "Udup", "Uls", "Umb", "Unk", "UnkAM", "UnkBB", "UnkJ", 
"UnkJF", "UnkJM", "Upps", "Utic", "Utr", "Vla", "Vul", "Xala", 
"Xar", "Xeni", "Xia", "Xian", "Xih", "Xin", "Xinp", "Xop", "Yam", 
"Yamu", "Yara", "Yaz", "Yelo", "Yodo", "Yuko"), class = "factor"), 
Behaviour = structure(c(2L, 3L, 1L, 1L, 1L), .Label = c("Crossing", 
"First Approacher", "First Crosser", "Last Crosser", "Summary"
), class = "factor"), CrossingType = c("Road - Ground Level", 
"Road - Ground Level", "Road - Ground Level", "Road - Ground Level", 
"Road - Ground Level"), GPSS = c(-27.9999, -27.9999, -27.9999, 
-27.9999, -27.9999), GPSE = c(31.20376, 31.20376, 31.20376, 
31.20376, 31.20376), Context = structure(c(1L, 1L, 1L, 1L, 
1L), .Label = c("Crossing", "Feeding", "Moving", "Unknown"
), class = "factor"), Observers = structure(c(12L, 12L, 12L, 
12L, 12L), .Label = c("Christelle", "Christelle; Giulia", 
"Christelle; Maria", "Elif; Giulia", "Josefien; Zach; Flavia; Maria", 
"Mathieu", "Mathieu; Giulia", "Mike; Mila", "Mila", "Mila; Christelle; Giulia", 
"Mila; Elif", "Mila; Giulia", "Nokubonga; Mila", "Nokubonga; Tam; Flavia", 
"Nokubonga; Tam; Flavia; Maria", "Nokubonga; Zach; Flavia; Maria", 
"Tam; Flavia", "Tam; Zach; Flavia; Maria", "Zach", "Zach; Elif; Giulia", 
"Zach; Flavia; Maria", "Zach; Giulia"), class = "factor"), 
DeviceId = structure(c(10L, 10L, 10L, 10L, 10L), .Label = c("{129F4050-2294-0D43-890F-3B2DEF58FC1A}", 
"{1A678F44-DB8C-1245-8DD7-9C2F92F086CA}", "{1B249FD2-AA95-5745-9A32-56CDD0587018}", 
"{2C7026A6-6EDC-BA4F-84EC-3DDADFFD4FD7}", "{2E489E9F-00BE-E342-8CAE-941618B2F0E6}", 
"{359CEB57-351F-F54F-B2BD-77A05FB6C349}", "{3727647C-B73A-184B-B187-D6BF75646B84}", 
"{7A4E6639-7387-7648-88EC-7FD27A0F258A}", "{854B02F2-5979-174A-AAE8-398C21664824}", 
"{89B5C791-1F71-0149-A2F7-F05E0197F501}", "{D92DF19A-9021-A740-AD99-DCCE1D88E064}"
), class = "factor"), Obs.nr = c(1, 1, 1, 1, 1), Gp.nr = c(1, 
3, 3, 4, 5)), row.names = c(NA, -5L), groups = structure(list(
Obs.nr = 1, .rows = structure(list(1:5), ptype = integer(0), class = c("vctrs_list_of", 
"vctrs_vctr", "list"))), row.names = c(NA, -1L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))

In here Gp.nr is my group number, IDIndividual1 is my ID.

score 1 · Accepted Answer · answered Jun 15 '22 at 13:51

1

This is not efficient at all, but as a starting point you can use (GN denotes the group number)

my_ID <- unique(df$ID)
matrix <- matrix(nrow = length(my_ID),ncol = length(my_ID))

for (i in 1:length(my_ID)){
  for (j in 1:length(my_ID)){
    matrix[i,j] <- length(intersect(df$GN[df$ID == my_ID[i]],df$GN[df$ID == my_ID[j]]))}}

answered Jun 15 '22 at 13:51

king_of_limes

359
1
11

Thank you so much! This seems to be working. Although it gives me a 1 in all fields of the matrix, also for individuals that haven't shared a group together. Could you explain to me why that might be? – Josefien Jun 16 '22 at 14:48
For me it works fine with `df <- data.frame(ID = c("Aa","Bd","Cr","Bd","Aa","Cr"), GN=c(1,1,2,1,2,2))` as input. Also, if you don't need the full matrix (since it is symmetric) you can just start the second for-loop at `i` instead of `1`. Like with the other answer, maybe it is to do with the format of your data? – king_of_limes Jun 16 '22 at 22:21

Deepansh Arora · Answer 2 · 2022-06-17T19:39:02.460

Check this out:

## Creating the Dataframe
df = data.frame(ID = c("Aa","Bd","Cc","Dd","Cr"),
                GroupNumber=c(1,2,1,3,3))

## Loading the libraries
library(dplyr)
library(tidyverse)
library(stringr)

## Grouping to find out which observations share same group
df1 = df %>%
  group_by(GroupNumber) %>%
  summarise(ID_=paste(ID, collapse="-"),
            CountbyID = n_distinct(ID_)) %>%
  filter(str_detect(ID_, "-")) 

## Creating all possible pair combinations and then joining and concatenating all rows
df2 = data.frame(t(combn(df$ID,2))) %>%
  mutate(Comb = paste(X1,"-",X2, sep = "")) %>%
  left_join(df1, by=c("Comb"="ID_")) %>%
  select(Comb, CountbyID) %>%
  replace(is.na(.), 0) %>%
  group_by(CountbyID) %>%
  summarise(ID=paste(Comb, collapse=";"))

Hope this helps!

UPDATE

The way the dataframe is setup, its causing issues to the "IDIndividual1" column. Based on the way it is setup, it has more factor levels than the unique data points. Therefore, I simply converted it to a character. Try the code below:

df = df[,c("IDIndividual1","Gp.nr")]
colnames(df) = c("ID","GroupNumber")
df$ID = as.character(df$ID) ## Converting factors to characters
## Loading the libraries
library(dplyr)
library(tidyverse)
library(stringr)

## Grouping to find out which observations share same group
df1 = df %>%
  group_by(GroupNumber) %>%
  summarise(ID_=paste(ID, collapse="-"),
            CountbyID = n_distinct(ID_)) %>%
  filter(str_detect(ID_, "-")) 

## Creating all possible pair combinations and then joining and concatenating all rows
df2 = data.frame(t(combn(df$ID,2))) %>%
  distinct() %>%
  filter(X1 != X2) %>%
  mutate(Comb = paste(X1,"-",X2, sep = "")) %>%
  left_join(df1, by=c("Comb"="ID_")) %>%
  select(Comb, CountbyID) %>%
  replace(is.na(.), 0) %>%
  group_by(CountbyID) %>%
  summarise(ID=paste(Comb, collapse=";"))

Thank you very much for your code! But I can't seem to get the df2 to work, it gives me this error: "Error: Problem with `mutate()` column `Comb`. i `Comb = paste(X1, "-", X2, sep = "")`. x object 'X1' not found" Sorry it might be an easy solution, but did I do something wrong here? Thanks in advance! — Josefien, Jun 16 '22 at 14:49
The code works perfectly fine for me. Is it possible you could show me a subset of your data? — Deepansh Arora, Jun 16 '22 at 14:57
I've added some example data, hope that works. Thank you so much for taking the time to help me out! — Josefien, Jun 17 '22 at 06:59
@Josefien Please see the update section! Hope this will solve your issue. — Deepansh Arora, Jun 17 '22 at 19:39
Hi Deepash, thank you so much for the update and sorry for my late response, analysing data while in the field is turning out harder than I expected... But this seems to work! Thanks a lot for your effort :) — Josefien, Jun 22 '22 at 11:29

Count how often two factors have the same output value

2 Answers2

UPDATE