2

I am seeking to determine how many times specific people (ID) collaborate within a predetermined time interval that precedes a given year. Ideally, the solution would be able to provide three things:

  1. The total number of collaborations per individual
  2. The total number of different individuals each person has collaborated with
  3. The total number of repeat collaborations in the existing team

A bunch of questions - 1, 2 ,3 ,4 - relate to this, but non seem to use two distinct criteria. I adapt the data example given here and solved by @trinker.

dat <- read.table(text="Group ID Time
Trx1 A 1980
Trx1 B 1980
Trx1 C 1980
Trx2 E 1980
Trx2 B 1980
Trx3 B 1981
Trx3 C 1981
Trx4 C 1983
Trx4 E 1983
Trx4 B 1983
Trx5 F 1984
Trx5 B 1984
Trx5 C 1984
Trx6 A 1986", header=T) 

So in 1980, two projects occur Trx1 and Trx2. In the first one, A, B, and C collaborate, in the second one, E and B do.

For every IDi I want, the total number of IDj(j <> i) with whom IDi collaborates on a Trx project within z (say 3) years before the focal project. Also, I want the number of different collaborators (e.g. 2 collaborations between A and B in the 3 year interval are only counted as 1). Finally, I need the repeat ties. For parsimony, all zero values can be excluded. My real data has over 40,000 ID, over 90,000 projects, and over 50 years. The outcome should look like this:

ID  year total diff repeat
B   1981    3   3   1
C   1981    2   2   1
B   1983    4   3   3 *
C   1983    3   2   2
E   1983    1   1   1
B   1984    3   2   2
C   1984    3   2   2 **

Explanation:

  • Before 1983, B collaborates twice with C, once with A, and once with E (column total = 4). Only one collaboration with C matters for the fourth column (diff = 3). For the last column, I count the number of repeat collaborations with the current team members. B worked with C at Trx1 and Trx3 and with E at Trx2, thus repeat = 3

** C collaborated between 1981 and 1983 (3 year interval, exclude, 1980) with 3 people in total. 2 collaborations are with B (Trx3, Trx4), 1 is with E (Trx4), thus total = 3, diff = 2. Of the current team members (B and F), C had zero prior collaborations with F and 2 with B in the 3 year window (repeat = 2).

This question is quasi identical to another one I asked here, but I hope it is clearer here.

Community
  • 1
  • 1
SJDS
  • 1,239
  • 1
  • 16
  • 31
  • I don't quiet get the third point. The results includes only 1 ID. Where do you explicitly mention that this number is the collaboration with C and not A for example? –  Dec 08 '15 at 08:14
  • @Deena, I'm not sure what you mean with "the third point". Do you mean how I get the values for `repeat`. `total` counts all collaborations, regardless of how many times the current individual has repeatedly worked with others in the past. `repeat` counts the number of times the current individual has worked with any of the other current team members in the past. Hope that is what you were wondering about. If not, please ask again. – SJDS Dec 08 '15 at 09:13

1 Answers1

2

Here you go:

library(data.table)
library(magrittr)
options(stringsAsFactors = F)

dat <- read.table(text="Group ID Time
Trx1 A 1980
Trx1 B 1980
Trx1 C 1980
Trx2 E 1980
Trx2 B 1980
Trx3 B 1981
Trx3 C 1981
Trx4 C 1983
Trx4 E 1983
Trx4 B 1983
Trx5 F 1984
Trx5 B 1984
Trx5 C 1984
Trx6 A 1986", header=T) 

str(dat)
dat = as.data.table(dat)

priorYears = 3
unqIDs = unique(dat$ID)


results = data.table(ID = character(), year = numeric(), total = numeric(), diff = numeric(), repeatSum = numeric())

for(i in 1:nrow(dat)){

  endYear = dat$Time[i] 
  startYear = endYear - priorYears
  this.ID = dat$ID[i]
  this.group = dat$Group[i]

  #Dates filtering
  subset.DT = dat[dat$Time >= startYear & dat$Time < endYear] 

  # Keep projects where my current ID collaborated 
  groupsToKeep = subset.DT$Group[subset.DT$ID == this.ID] %>% unique
  subset.DT = subset.DT[subset.DT$Group %in% groupsToKeep,]


  # Calculations
  unqMembers = unique(subset.DT$ID) %>% .[. != this.ID]
  currentMembers = dat$ID[dat$Group == this.group] %>% .[. != this.ID]

  total = length(which(subset.DT$ID != this.ID))
  diff = length(unqMembers)
  repeatSum = sum(table(subset.DT$ID)[currentMembers], na.rm = T)

  # Add results
  results = rbind(results, data.frame(ID = this.ID, year = endYear, total, diff, repeatSum))

}`
  • thanks a lot for this solution. It works fine for the first two columns but the repeatSum gave me only zero values. It's not a very R-way of solving this but it does work (took about 2 hours on my computer with 240,000 rows... I'm gonna check if I can find what went wrong with the `repeatSum`. Thanks! – SJDS Dec 09 '15 at 05:27
  • Found the problem. It was in my own code :). I failed to adjust something small. So it seems to work perfectly! Great thanks so much – SJDS Dec 09 '15 at 06:42