6

My current data looks like this:

Person  Team
  10    100
  11    100
  12    100
  10    200
  11    200
  14    200
  15    200

I want to infer who knew one another, based on what teams they were on together. I also want a count of how many times a dyad was on a team together, and I want to keep track of the team identification codes that link each pair of people. In other words, I want to create a data set that looks like this:

Person1 Person2 Count   Team1   Team2   Team3
   10      11     2      100     200     NA
   10      12     1      100     NA      NA
   11      12     1      100     NA      NA
   10      14     1      200     NA      NA
   10      15     1      200     NA      NA
   11      14     1      200     NA      NA
   11      15     1      200     NA      NA

The resulting data set captures the relationships that can be inferred based on the teams that were outlined in the original data set. The "Count" variable reflects the number of instances that a pair of people was on a team together. The "Team1", "Team2", and "Team3" variables list the team ID(s) that link each pair of people to one another. It doesn't make a difference which person/team ID is listed first versus second. Teams range in size from 2 members to 8 members.

waxattax
  • 357
  • 2
  • 12
  • 4
    It's really very annoying when people say "my data is like ...." and then post an example that does not allow comparison of analysis to the desired result. Learn to post the output of some portion of the data and the exact output expected. Could be the output of `dput(head(dat, 10))` – IRTFM Jan 06 '15 at 23:47
  • 1
    @BondedDust, the desired output is perfectly corresponds to the provided data. – David Arenburg Jan 07 '15 at 00:30
  • 1
    Almost perfectly... it's missing one row. Persons 14 and 15 were both on Team 200. – Gregor Thomas Jan 07 '15 at 00:50

4 Answers4

6

Here's a "data.table" solution that seems to get to where you want to get (albeit with quite a mouthful of code):

library(data.table)
dcast.data.table(
  dcast.data.table(
    as.data.table(d)[, combn(Person, 2), by = Team][
      , ind := paste0("Person", c(1, 2))][
        , time := sequence(.N), by = list(Team, ind)], 
    time + Team ~ ind, value.var = "V1")[
      , c("count", "time") := list(.N, sequence(.N)), by = list(Person1, Person2)],
  Person1 + Person2 + count ~ time, value.var = "Team")
#    Person1 Person2 count   1   2
# 1:      10      11     2 100 200
# 2:      10      12     1 100  NA
# 3:      10      14     1 200  NA
# 4:      10      15     1 200  NA
# 5:      11      12     1 100  NA
# 6:      11      14     1 200  NA
# 7:      11      15     1 200  NA
# 8:      14      15     1 200  NA

Update: Step-by-step version of the above

To understand what's happening above, here's a step-by-step approach:

## The following would be a long data.table with 4 columns:
##   Team, V1, ind, and time
step1 <- as.data.table(d)[
  , combn(Person, 2), by = Team][
    , ind := paste0("Person", c(1, 2))][
      , time := sequence(.N), by = list(Team, ind)]
head(step1)
#    Team V1     ind time
# 1:  100 10 Person1    1
# 2:  100 11 Person2    1
# 3:  100 10 Person1    2
# 4:  100 12 Person2    2
# 5:  100 11 Person1    3
# 6:  100 12 Person2    3

## Here, we make the data "wide"
step2 <- dcast.data.table(step1, time + Team ~ ind, value.var = "V1")
step2
#    time Team Person1 Person2
# 1:    1  100      10      11
# 2:    1  200      10      11
# 3:    2  100      10      12
# 4:    2  200      10      14
# 5:    3  100      11      12
# 6:    3  200      10      15
# 7:    4  200      11      14
# 8:    5  200      11      15
# 9:    6  200      14      15

## Create a "count" column and a "time" column,
##   grouped by "Person1" and "Person2".
##   Count is for the count column.
##   Time is for going to a wide format
step3 <- step2[, c("count", "time") := list(.N, sequence(.N)), 
               by = list(Person1, Person2)]
step3
#    time Team Person1 Person2 count
# 1:    1  100      10      11     2
# 2:    2  200      10      11     2
# 3:    1  100      10      12     1
# 4:    1  200      10      14     1
# 5:    1  100      11      12     1
# 6:    1  200      10      15     1
# 7:    1  200      11      14     1
# 8:    1  200      11      15     1
# 9:    1  200      14      15     1

## The final step of going wide
out <- dcast.data.table(step3, Person1 + Person2 + count ~ time, 
                        value.var = "Team")
out
#    Person1 Person2 count   1   2
# 1:      10      11     2 100 200
# 2:      10      12     1 100  NA
# 3:      10      14     1 200  NA
# 4:      10      15     1 200  NA
# 5:      11      12     1 100  NA
# 6:      11      14     1 200  NA
# 7:      11      15     1 200  NA
# 8:      14      15     1 200  NA
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
4

The counts are easy to obtain with a self-join, which I think is easiest to do using sqldf. (Note that I probably think sqldf is easiest because I'm not too good with data.table.) Editing to include @G. Grothendieck's suggestion:

# your data
dd <- structure(list(Person = c(10L, 11L, 12L, 10L, 11L, 14L, 15L), 
    Team = c(100L, 100L, 100L, 200L, 200L, 200L, 200L)), .Names = c("Person", 
"Team"), class = "data.frame", row.names = c(NA, -7L))

library(sqldf)
dyads = sqldf("select dd1.Person Person1, dd2.Person Person2
               , count(*) Count
               , group_concat(dd1.Team) Teams 
  from dd dd1
  inner join dd dd2
  on dd1.Team = dd2.Team and dd1.Person < dd2.Person
  group by dd1.Person, dd2.Person")

  Person1 Person2 Count   Teams
1      10      11     2 100,200
2      10      12     1     100
3      10      14     1     200
4      10      15     1     200
5      11      12     1     100
6      11      14     1     200
7      11      15     1     200
8      14      15     1     200

We can then split the string to get the columns you want.

library(stringr)
cbind(dyads, apply(str_split_fixed(dyads$Teams, ",",
                                   n = max(str_count(dyads$Teams, pattern = ",")) + 1),
                   MARGIN = 2, FUN = as.numeric))
  Person1 Person2 Count   Teams   1   2
1      10      11     2 100,200 100 200
2      10      12     1     100 100  NA
3      10      14     1     200 200  NA
4      10      15     1     200 200  NA
5      11      12     1     100 100  NA
6      11      14     1     200 200  NA
7      11      15     1     200 200  NA
8      14      15     1     200 200  NA

I'll leave the column renaming to you.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • 1
    This modification would add a `Teams` column with a comma separated list of teams: `sqldf("select dd1.Person Person1, dd2.Person Person2, count(*) Count, group_concat(dd1.Team) Teams from dd dd1 inner join dd dd2 on dd1.Team = dd2.Team and dd1.Person < dd2.Person group by dd1.Person, dd2.Person")` – G. Grothendieck Jan 07 '15 at 13:02
4

Following @Gregor and using Gregor's data, I tried to add team columns. I could not produce what you requested, but this may be useful. Using full_join in the dev version of dplyr (dplyr 0.4), I did the following. I created a data frame for each team with all combinations of Person using combn and saved the data as the object, a. Then, I separated a by team and used full_join. In this way, I tried to create team columns, at least for team 100 and 200. I used rename to change column names and select to order the columns in your way.

library(dplyr)

group_by(dd, Team) %>%
do(data.frame(t(combn(.$Person, 2)))) %>%
data.frame() ->a;
full_join(filter(a, Team == "100"), filter(a, Team == "200"), by = c("X1", "X2")) %>%
rename(Person1 = X1, Person2 = X2, Team1 = Team.x, Team2 = Team.y) %>%
select(Person1, Person2, Team1, Team2)

#  Person1 Person2 Team1 Team2
#1      10      11   100   200
#2      10      12   100    NA
#3      11      12   100    NA
#4      10      14    NA   200
#5      10      15    NA   200
#6      11      14    NA   200
#7      11      15    NA   200
#8      14      15    NA   200

EDIT

I am sure there are better ways of doing this. But, this is the closest I can do. I tried to add the count using another join in this version.

group_by(dd, Team) %>%
do(data.frame(t(combn(.$Person, 2)))) %>%
data.frame() ->a;
full_join(filter(a, Team == "100"), filter(a, Team == "200"), by = c("X1", "X2")) %>%
full_join(count(a, X1, X2), by = c("X1", "X2")) %>%
rename(Person1 = X1, Person2 = X2, Team1 = Team.x, Team2 = Team.y, Count = n) %>%
select(Person1, Person2, Count, Team1, Team2)

#  Person1 Person2 Count Team1 Team2
#1      10      11     2   100   200
#2      10      12     1   100    NA
#3      11      12     1   100    NA
#4      10      14     1    NA   200
#5      10      15     1    NA   200
#6      11      14     1    NA   200
#7      11      15     1    NA   200
#8      14      15     1    NA   200
jazzurro
  • 23,179
  • 35
  • 66
  • 76
1

Here is the general solution:

library(dplyr)
library(reshape2)

find.friends <- function(d,n=2) {
    d$exist <- T

    z <- dcast(d,Person~Team,value.var='exist')
    #       Person  100  200
    #     1     10 TRUE TRUE
    #     2     11 TRUE TRUE
    #     3     12 TRUE   NA
    #     4     14   NA TRUE
    #     5     15   NA TRUE


    pairs.per.team <- sapply(
                        sort(unique(d$Team)),
                        function(team) {
                            non.na <- !is.na(z[,team])
                            if (sum(non.na)<n) return()
                            combns <- t(combn(z$Person[non.na],n))
                            cbind(combns,team)
                        }
    )
    df <- as.data.frame(do.call(rbind,pairs.per.team))    
    if (nrow(df)==0) return()

    persons <- sprintf('Person%i',1:n)
    colnames(df)[1:n] <- persons
    #       Person1 Person2 team
    #     1      10      11  100
    #     2      10      12  100
    #     3      11      12  100
    #     4      10      11  200
    #     5      10      14  200
    #     6      10      15  200
    #     7      11      14  200
    #     8      11      15  200
    #     9      14      15  200
    # Personally, I find the data frame above most suitable for further analysis.
    # The following code is needed only to make the output compatible with the author's one


    df2 <- df %>% 
            grouped_df(as.list(persons)) %>% 
            mutate(i.team=paste0('team',seq_along(team))) 
    #       Person1 Person2 team i.team
    #     1      10      11  100  team1
    #     2      10      12  100  team1
    #     3      11      12  100  team1
    #     4      10      11  200  team2
    #     5      10      14  200  team1
    #     6      10      15  200  team1
    #     7      11      14  200  team1
    #     8      11      15  200  team1
    #     9      14      15  200  team1

    # count number of teams per pair
    df2.count <- df %>% 
                    grouped_df(as.list(persons)) %>% 
                    summarize(cnt=length(team))

    # reshape the data
    df3 <- dcast(df2,
         as.formula(sprintf('%s~i.team',paste(persons,collapse='+'))),
         value.var='team'
         )

    df3$count <- df2.count$cnt
    df3
}

Your data are:

d <- structure(list(Person = c("10", "11", "12", "10", "11", "14", 
                          "15"), Team = c("100", "100", "100", "200", "200", "200", "200"
                          )), .Names = c("Person", "Team"), row.names = c(NA, -7L), class = "data.frame")

By using

 find.friends(d,n=2)

you should obtain the desired output.

By changing n, you can also find triads, tetrads etc.

Marat Talipov
  • 13,064
  • 5
  • 34
  • 53
  • 1
    I presume that since you're using `%>%`, you meant `library(dplyr)` not `library(plyr)`? Also, any reason to use "reshape" and not "reshape2" or "tidyr" (which should be more efficient)? – A5C1D2H2I1M1N2O1R2T1 Jan 07 '15 at 06:37