6

I would like to combine two tables based on first name, last name, and year, and create a new binary variable indicating whether the row from table 1 was present in the 2nd table.

First table is a panel data set of some attributes of NBA players during a season:

   firstname<-c("Michael","Michael","Michael","Magic","Magic","Magic","Larry","Larry")
   lastname<-c("Jordan","Jordan","Jordan","Johnson","Johnson","Johnson","Bird","Bird")
   year<-c("1991","1992","1993","1991","1992","1993","1992","1992")

   season<-data.frame(firstname,lastname,year)


    firstname   lastname        year
  1 Michael      Jordan         1991
  2 Michael      Jordan         1992
  3 Michael      Jordan         1993
  4 Magic        Johnson        1991
  5 Magic        Johnson        1992
  6 Magic        Johnson        1993
  7 Larry        Bird           1992
  8 Larry        Bird           1992

The second data.frame is a panel data set of some attributes of NBA players selected to the All-Star game:

   firstname<-c("Michael","Michael","Michael","Magic","Magic","Magic")
   lastname<-c("Jordan","Jordan","Jordan","Johnson","Johnson","Johnson")
   year<-c("1991","1992","1993","1991","1992","1993")

    ALLSTARS<-data.frame(firstname,lastname,year)



     firstname  lastname    year
  1 Michael     Jordan    1991
  2 Michael     Jordan    1992
  3 Michael     Jordan    1993
  4 Magic       Johnson   1991
  5 Magic       Johnson   1992
  6 Magic       Johnson   1993

My desired result looks like:

  firstname lastname    year    allstars

   1    Michael Jordan  1991    1
   2    Michael Jordan  1992    1
   3    Michael Jordan  1993    1
   4    Magic   Johnson 1991    1
   5    Magic   Johnson 1992    1
   6    Magic   Johnson 1993    1
   7    Larry   Bird    1992    0
   8    Larry   Bird    1992    0

I tried to use a left join. But not sure whether that makes sense:

    test<-join(season, ALLSTARS, by =c("lastname","firstname","year") , type = "left", match = "all")
Sam Firke
  • 21,571
  • 9
  • 87
  • 105
  • I would use `left_join` or `right_join` from the dplyr package as in David's answer. But for fixing your code: it looks like you are using `join()` from the plyr package. You were almost there, just preface your command with `ALLSTARS$allstars <- 1`. Then do your join and convert the `NA` values to 0. – Sam Firke Jul 09 '15 at 12:59
  • possible duplicate of [How to join (merge) data frames (inner, outer, left, right)?](http://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right) – Sam Firke Jul 09 '15 at 13:05
  • @ Sam Firke. That was just extraordinary! :) Thank you. Worked perfectly! –  Jul 09 '15 at 13:07
  • Glad it helped - I've written it up as an answer below. – Sam Firke Jul 09 '15 at 14:57

3 Answers3

4

Here's a simple solution using data.table binary join which allows you to update a column by reference while joing

library(data.table)
setkey(setDT(season), firstname, lastname, year)[ALLSTARS, allstars := 1L]
season
#    firstname lastname year allstars
# 1:     Larry     Bird 1992       NA
# 2:     Larry     Bird 1992       NA
# 3:     Magic  Johnson 1991        1
# 4:     Magic  Johnson 1992        1
# 5:     Magic  Johnson 1993        1
# 6:   Michael   Jordan 1991        1
# 7:   Michael   Jordan 1992        1
# 8:   Michael   Jordan 1993        1

Or using dplyr

library(dplyr)
ALLSTARS %>% 
  mutate(allstars = 1L) %>%
  right_join(., season)
#   firstname lastname year allstars
# 1   Michael   Jordan 1991        1
# 2   Michael   Jordan 1992        1
# 3   Michael   Jordan 1993        1
# 4     Magic  Johnson 1991        1
# 5     Magic  Johnson 1992        1
# 6     Magic  Johnson 1993        1
# 7     Larry     Bird 1992       NA
# 8     Larry     Bird 1992       NA
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • @ David, thank you so much! My example is somehow misleading. Are you referening in your example to the column `allstars` or table `ALLSTARS` –  Jul 09 '15 at 12:32
  • using the first advice I get: `Error in bmerge(i <- shallow(i), x, leftcols, rightcols, io <- haskey(i), : x.'firstname' is a character column being joined to i.'year' which is type 'integer'. Character columns must join to factor or character columns.` –  Jul 09 '15 at 12:41
  • I don't get these errors, but i'd suggest you use `stringsAsFactors = FALSE` when creating your data sets as in `season<-data.frame(firstname,lastname,year, stringsAsFactors = FALSE)` and `ALLSTARS<-data.frame(firstname,lastname,year, stringsAsFactors = FALSE)`. What is the `data.table` version you are using btw? – David Arenburg Jul 09 '15 at 12:44
  • Its in the corresponding data types(`first/lastname=chr`) Hmm ok thanx!! 1.9.4 –  Jul 09 '15 at 12:47
  • I'm using the devel version (1.9.5) so it was probably fixed there. – David Arenburg Jul 09 '15 at 12:50
2

In base R:

ALLSTARS$allstars <- 1L
newdf <- merge(season, ALLSTARS, by=c('firstname', 'lastname', 'year'), all.x=TRUE)
newdf$allstars[is.na(newdf$allstars)] <- 0L 
newdf

Or one I like for a different approach:

season$allstars <- (apply(season, 1, function(x) paste(x, collapse='')) %in%
apply(ALLSTARS, 1, function(x) paste(x, collapse='')))+0L
# 
#   firstname lastname year allstars
# 1   Michael   Jordan 1991        1
# 2   Michael   Jordan 1992        1
# 3   Michael   Jordan 1993        1
# 4     Magic  Johnson 1991        1
# 5     Magic  Johnson 1992        1
# 6     Magic  Johnson 1993        1
# 7     Larry     Bird 1992        0
# 8     Larry     Bird 1992        0
Pierre L
  • 28,203
  • 6
  • 47
  • 69
1

It looks like you are using join() from the plyr package. You were almost there: just preface your command with ALLSTARS$allstars <- 1. Then do your join as it is written and finally convert the NA values to 0. So:

ALLSTARS$allstars <- 1
test <- join(season, ALLSTARS, by =c("lastname","firstname","year") , type = "left", match = "all")
test$allstars[is.na(test$allstars)] <- 0

Result:

  firstname lastname year allstars
1   Michael   Jordan 1991        1
2   Michael   Jordan 1992        1
3   Michael   Jordan 1993        1
4     Magic  Johnson 1991        1
5     Magic  Johnson 1992        1
6     Magic  Johnson 1993        1
7     Larry     Bird 1992        0
8     Larry     Bird 1992        0

Though I personally would use left_join or right_join from the dplyr package, as in David's answer, instead of plyr's join(). Also note that you don't actually need the by argument of join() in this case as by default the function will try to join on all fields with common names, which is what you want here.

Sam Firke
  • 21,571
  • 9
  • 87
  • 105