Identifying Duplicate/Unique Teams (and Restructuring Data) in R

Question

I have a data set that looks like this:

 Person Team
   1     30
   2     30
   3     30
   4     30
   11    40
   22    40
   1     50
   2     50
   3     50
   4     50
   15    60
   16    60
   17    60
   1     70
   2     70
   3     70
   4     70
   11    80
   22    80

My overall goal is to organize that team identification codes so that it is easy to see which teams are duplicates of one another and which teams are unique. I want to summarize the data so that it looks like this:

 Team   Duplicate1  Duplicate2
  30        50          70
  40        80  
  60

As you can see, teams 30, 50, and 70 have identical members, so they share a row. Similarly, teams 40 and 80 have identical members, so they share a row. Only team 60 (in this example) is unique.

In situations where teams are duplicated, I don't care which team id goes in which column. Also, there may be more than 2 duplicates of a team. Teams range in size from 2 members to 8 members.

rsoren · Answer 1 · 2015-01-05T22:44:40.550

This answer gives the output data format you asked for. I left the duplicate teams in a single variable because I think it's a better way to handle an arbitrary number of duplicates.

require(dplyr)

df %>%
  arrange(Team, Person) %>%   # this line is necessary in case the rest of your data isn't sorted
  group_by(Team) %>%
  summarize(players = paste0(Person, collapse = ",")) %>%
  group_by(players) %>%
  summarize(teams = paste0(Team, collapse = ",")) %>%
  mutate(
    original_team = ifelse(grepl(",", teams), substr(teams, 1, gregexpr(",", teams)[[1]][1]-1), teams),
    dup_teams = ifelse(grepl(",", teams), substr(teams, gregexpr(",", teams)[[1]][1]+1, nchar(teams)), NA)
  )

The result:

Source: local data frame [3 x 4]

   players    teams original_team dup_teams
1  1,2,3,4 30,50,70            30     50,70
2    11,22    40,80            40        80
3 15,16,17       60            60        NA

Good call on sorting first! I'm editing that in to my answer as well. — Gregor Thomas, Jan 06 '15 at 00:19

score 2 · Answer 2 · answered Jan 05 '15 at 22:04

Using this for your sample data

dd<-structure(list(Person = c(1L, 2L, 3L, 4L, 11L, 22L, 1L, 2L, 3L, 
4L, 15L, 16L, 17L, 1L, 2L, 3L, 4L, 11L, 22L), Team = c(30L, 30L, 
30L, 30L, 40L, 40L, 50L, 50L, 50L, 50L, 60L, 60L, 60L, 70L, 70L, 
70L, 70L, 80L, 80L)), .Names = c("Person", "Team"), 
class = "data.frame", row.names = c(NA, -19L))

You could try a table()/interaction() to find duplicate groups. For example

tt <- with(dd, table(Team, Person))
grp <- do.call("interaction", c(data.frame(unclass(tt)), drop=TRUE))
split(rownames(tt), grp)

this returns

$`1.1.1.1.0.0.0.0.0`
[1] "30" "50" "70"

$`0.0.0.0.0.1.1.1.0`
[1] "60"

$`0.0.0.0.1.0.0.0.1`
[1] "40" "80"

so the group "names" are really just indicators for membership for each person. You could easily rename them if you like with setNames(). But here it collapse the appropriate teams.

Gregor Thomas · Accepted Answer · 2015-01-06T00:20:57.977

Not exactly the format you're wanting, but pretty useful:

# using MrFlick's data
library(dplyr)
dd %>% group_by(Team) %>%
    arrange(Person) %>%
    summarize(team.char = paste(Person, collapse = "_")) %>%
    group_by(team.char) %>%
    arrange(team.char, Team) %>%
    mutate(duplicate = 1:n())

Source: local data frame [6 x 3]
Groups: team.char

  Team team.char duplicate
1   40     11_22         1
2   80     11_22         2
3   60  15_16_17         1
4   30   1_2_3_4         1
5   50   1_2_3_4         2
6   70   1_2_3_4         3

(Edited in the arrange(Person) line in case the data isn't already sorted, got the idea from @Reed's answer.)

talat · Answer 4 · 2015-01-05T22:42:23.710

Two more base R options (though not exactly the desired output):

DF2 <- aggregate(Person ~ Team, DF, toString)
> split(DF2$Team, DF2$Person)
$`1, 2, 3, 4`
[1] 30 50 70

$`11, 22`
[1] 40 80

$`15, 16, 17`
[1] 60

Or

( DF2$DupeGroup <- as.integer(factor(DF2$Person)) )
  Team     Person DupeGroup
1   30 1, 2, 3, 4         1
2   40     11, 22         2
3   50 1, 2, 3, 4         1
4   60 15, 16, 17         3
5   70 1, 2, 3, 4         1
6   80     11, 22         2

Note that the expected output as shown in the question would either require to add NAs or empty strings in some of the columns entries because in a data.frame, all columns must have the same number of rows. That is different for lists in, as you can see in some of the answers.

The second option, but using data.table, since aggregate tends to be slow for large data:

library(data.table)
setDT(DF)[, toString(Person), by=Team][,DupeGroup := .GRP, by=V1][]
   Team         V1 DupeGroup
1:   30 1, 2, 3, 4         1
2:   40     11, 22         2
3:   50 1, 2, 3, 4         1
4:   60 15, 16, 17         3
5:   70 1, 2, 3, 4         1
6:   80     11, 22         2

score 2 · Answer 5 · edited May 23 '17 at 10:33

Using uniquecombs from the mgcv package:

library(mgcv)
library(magrittr) # for the pipe %>%

# Using MrFlick's data
team_names <- sort(unique(dd$Team))
unique_teams <- with(dd, table(Team, Person)) %>% uniquecombs %>% attr("index")
printout <- unstack(data.frame(team_names, unique_teams))

> printout
$`1`
[1] 60

$`2`
[1] 40 80

$`3`
[1] 30 50 70

Now you could use something like this answer to print it in tabular form (note that the groups are column-wise, not row-wise as in your question):

attributes(printout) <- list(names = names(printout)
                             , row.names = 1:max(sapply(printout, length))
                             , class = "data.frame")
> printout
     1    2  3
1   60   40 30
2 <NA>   80 50
3 <NA> <NA> 70
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
  corrupt data frame: columns will be truncated or padded with NAs

Identifying Duplicate/Unique Teams (and Restructuring Data) in R

5 Answers5