1

I have two sets of panel data that I would like to merge. The problem is that, for each respective time interval, the variable which links the two data sets appears more frequently in the first data frame than the second. My objective is to add each row from the second data set to its corresponding row in the first data set, even if that necessitates copying said row multiple times in the same time interval. Specifically, I am working with basketball data from the NBA. The first data set is a panel of Player and Date while the second is one of Team (Tm) and Date. Thus, each Team entry should be copied multiple times per date, once for each player on that team who played that day. I could do this easily in excel, but the data frames are too large.

The result is 0 observations of 52 variables. I've experimented with bind, match, different versions of merge, and I've searched for everything I can think of; but, nothing seems to address this issue specifically. Disclaimer, I am very new to R.

Here is my code up until my road block:

HGwd = "~/Documents/Fantasy/Basketball"
library(plm)
library(mice)
library(VIM)
library(nnet)
library(tseries)
library(foreign)
library(ggplot2)
library(truncreg)
library(boot)
Pdata = read.csv("2015-16PlayerData.csv", header = T)
attach(Pdata)
Pdata$Age = as.numeric(as.character(Pdata$Age))
Pdata$Date = as.Date(Pdata$Date, '%m/%e/%Y')
names(Pdata)[8] = "OppTm"
Pdata$GS = as.factor(as.character(Pdata$GS))
Pdata$MP = as.numeric(as.character(Pdata$MP))
Pdata$FG = as.numeric(as.character(Pdata$FG))
Pdata$FGA = as.numeric(as.character(Pdata$FGA))
Pdata$X2P = as.numeric(as.character(Pdata$X2P))
Pdata$X2PA = as.numeric(as.character(Pdata$X2PA))
Pdata$X3P = as.numeric(as.character(Pdata$X3P))
Pdata$X3PA = as.numeric(as.character(Pdata$X3PA))
Pdata$FT = as.numeric(as.character(Pdata$FT))
Pdata$FTA = as.numeric(as.character(Pdata$FTA))
Pdata$ORB = as.numeric(as.character(Pdata$ORB))
Pdata$DRB = as.numeric(as.character(Pdata$DRB))
Pdata$TRB = as.numeric(as.character(Pdata$TRB))
Pdata$AST = as.numeric(as.character(Pdata$AST))
Pdata$STL = as.numeric(as.character(Pdata$STL))
Pdata$BLK = as.numeric(as.character(Pdata$BLK))
Pdata$TOV = as.numeric(as.character(Pdata$TOV))
Pdata$PF = as.numeric(as.character(Pdata$PF))
Pdata$PTS = as.numeric(as.character(Pdata$PTS))
PdataPD = plm.data(Pdata, index = c("Player", "Date"))
attach(PdataPD)
Tdata = read.csv("2015-16TeamData.csv", header = T)
attach(Tdata)
Tdata$Date = as.Date(Tdata$Date, '%m/%e/%Y')
names(Tdata)[3] = "OppTm"
Tdata$MP = as.numeric(as.character(Tdata$MP))
Tdata$FG = as.numeric(as.character(Tdata$FG))
Tdata$FGA = as.numeric(as.character(Tdata$FGA))
Tdata$X2P = as.numeric(as.character(Tdata$X2P))
Tdata$X2PA = as.numeric(as.character(Tdata$X2PA))
Tdata$X3P = as.numeric(as.character(Tdata$X3P))
Tdata$X3PA = as.numeric(as.character(Tdata$X3PA))
Tdata$FT = as.numeric(as.character(Tdata$FT))
Tdata$FTA = as.numeric(as.character(Tdata$FTA))
Tdata$PTS = as.numeric(as.character(Tdata$PTS))
Tdata$Opp.FG = as.numeric(as.character(Tdata$Opp.FG))
Tdata$Opp.FGA = as.numeric(as.character(Tdata$Opp.FGA))
Tdata$Opp.2P = as.numeric(as.character(Tdata$Opp.2P))
Tdata$Opp.2PA = as.numeric(as.character(Tdata$Opp.2PA))
Tdata$Opp.3P = as.numeric(as.character(Tdata$Opp.3P))
Tdata$Opp.3PA = as.numeric(as.character(Tdata$Opp.3PA))
Tdata$Opp.FT = as.numeric(as.character(Tdata$Opp.FT))
Tdata$Opp.FTA = as.numeric(as.character(Tdata$Opp.FTA))
Tdata$Opp.PTS = as.numeric(as.character(Tdata$Opp.PTS))
TdataPD = plm.data(Tdata, index = c("OppTm", "Date"))
attach(TdataPD)
PD = merge(PdataPD, TdataPD, by = "OppTm", all.x = TRUE)
attach(PD)

Any help on how to do this would be greatly appreciated!

EDIT

I've tweaked it a little from last night, but still nothing seems to do the trick. See the above, updated code for what I am currently using.

Here is the output for head(PdataPD):

    Player       Date   Rk Pos  Tm X..H OppTm W.L GS MP FG FGA   FG. X2P
22408 Aaron Brooks 2015-10-27  817   G CHI        CLE   W  0 16  3   9 0.333   3
22144 Aaron Brooks 2015-10-28  553   G CHI    @   BRK   W  0 16  5   9 0.556   3
21987 Aaron Brooks 2015-10-30  396   G CHI    @   DET   L  0 18  2   6 0.333   1
21456 Aaron Brooks 2015-11-01 4687   G CHI        ORL   W  0 16  3  11 0.273   3
21152 Aaron Brooks 2015-11-03 4383   G CHI    @   CHO   L  0 17  5   8 0.625   1
20805 Aaron Brooks 2015-11-05 4036   G CHI        OKC   W  0 13  4   8 0.500   3
      X2PA  X2P. X3P X3PA  X3P. FT FTA FT. ORB DRB TRB AST STL BLK TOV PF PTS GmSc
22408    8 0.375   0    1 0.000  0   0  NA   0   2   2   0   0   0   2  1   6 -0.9
22144    3 1.000   2    6 0.333  0   0  NA   0   1   1   3   1   0   1  4  12  8.5
21987    2 0.500   1    4 0.250  0   0  NA   0   4   4   4   0   0   0  1   5  5.2
21456    6 0.500   0    5 0.000  0   0  NA   2   1   3   1   1   1   1  4   6  1.0
21152    3 0.333   4    5 0.800  0   0  NA   0   0   0   4   1   0   0  4  14 12.6
20805    5 0.600   1    3 0.333  0   0  NA   1   1   2   0   0   0   0  1   9  5.6
       FPTS H.A
22408  7.50   H
22144 20.25   A
21987 16.50   A
21456 14.75   H
21152 24.00   A
20805 12.00   H

And for head(TdataPD):

    OppTm       Date  Rk X Opp    Result  MP FG FGA   FG. X2P X2PA  X2P. X3P X3PA
2105   ATL 2015-10-27  71   DET  L 94-106 240 37  82 0.451  29   55 0.527   8   27
2075   ATL 2015-10-29  41 @ NYK W 112-101 240 42  83 0.506  32   59 0.542  10   24
2047   ATL 2015-10-30  13   CHO   W 97-94 240 36  83 0.434  28   60 0.467   8   23
2025   ATL 2015-11-01 437 @ CHO   W 94-92 240 37  88 0.420  30   59 0.508   7   29
2001   ATL 2015-11-03 413 @ MIA   W 98-92 240 37  90 0.411  30   69 0.435   7   21
1973   ATL 2015-11-04 385   BRK  W 101-87 240 37  76 0.487  29   54 0.537   8   22
      X3P. FT FTA   FT. PTS Opp.FG Opp.FGA Opp.FG. Opp.2P Opp.2PA Opp.2P. Opp.3P
2105 0.296 12  15 0.800  94     37      96   0.385     25      67   0.373     12
2075 0.417 18  26 0.692 112     38      93   0.409     32      64   0.500      6
2047 0.348 17  22 0.773  97     36      88   0.409     24      58   0.414     12
2025 0.241 13  14 0.929  94     32      86   0.372     18      49   0.367     14
2001 0.333 17  22 0.773  98     38      86   0.442     33      58   0.569      5
1973 0.364 19  24 0.792 101     36      83   0.434     31      62   0.500      5
     Opp.3PA Opp.3P. Opp.FT Opp.FTA Opp.FT. Opp.PTS
2105      29   0.414     20      26   0.769     106
2075      29   0.207     19      21   0.905     101
2047      30   0.400     10      13   0.769      94
2025      37   0.378     14      15   0.933      92
2001      28   0.179     11      16   0.688      92
1973      21   0.238     10      13   0.769      87

If there is way to truncate the output from dput(head(___)), I am not familiar with it. It appears that simply erasing the excess characters would remove entire variables from the dataset.

  • 2
    It's easier for others to help if you post your actual data. Rather than all the code you used to create it, change your question to include the output of `dput(head(PdataPD))` and `dput(head(TdataPD))` – SymbolixAU Mar 25 '16 at 04:21
  • 1
    And, you need to tell it what to `merge` on. I think you're after `PD <- merge(PdataPD, tdataPD, by="Date", all.x=TRUE)` – SymbolixAU Mar 25 '16 at 04:26
  • It won't let me show you any of the dput outputs; they're too long.However, PD = merge(PdataPD, TdataPD, by = "Date", all.x = TRUE) came back: Warning messages: 1: In .HTMLsearch(query) : Unrecognized search field: title 2: In .HTMLsearch(query) : Unrecognized search field: keyword 3: In .HTMLsearch(query) : Unrecognized search field: alias It appears to have worked in reverse, by copying each player, multiple times, to each game for a given date, rather than copying each game, multiple times, to each player for a given date. – Brandon McCormick Mar 25 '16 at 05:12
  • swap `all.x=TRUE` for `all.y=TRUE` – SymbolixAU Mar 25 '16 at 05:43
  • All that did was copy each player multiple times to the same date. – Brandon McCormick Mar 25 '16 at 06:43
  • It's hard to help you without seeing your data. You need to make a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) so we know what you're dealing with. Give us an example of your data, not all of it, but enough to reproduce your problem – SymbolixAU Mar 25 '16 at 07:40
  • I understand, please see my edit. – Brandon McCormick Mar 25 '16 at 23:42

1 Answers1

1

It would help if you posted your data (or a working subset of it) and a little more detail on how you are trying to merge, but if I understand what you are trying to do, you want each final data record to have individual stats for each player on a particular date followed by the player's team's stats for that date. In this case, you should have a team column in the Player table that identifies the player's team, and then join the two tables on the composite key Date and Team by setting the by= attribute in merge:

merge(PData, TData, by=c("Date", "Team")) 

The fact that the data frames are of different lengths doesn't matter--this is exactly what join/merge operations are for.

For an alternative to merge(), you might check out the dplyr package join functions at https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html

Nat
  • 225
  • 2
  • 9
  • I actually want his opponent's team stats for that date, but, otherwise yes, that is what I am trying to do. You solved it! I had to create the common variable, but including it along with Date in the by() function is what did the trick! Thank you so much! – Brandon McCormick Mar 25 '16 at 23:49
  • If you can't already see where I'm going with this, my next question is how do you lag a variable? Every time I use the lag() function it seems to just copy it. – Brandon McCormick Mar 25 '16 at 23:51
  • You should probably create a new post on SO since this is a separate question, but the lag() function will not overwrite the original variable unless you explicitly reassign it. If you have variable `var=c(1, 2, 3)` and you want `var` to be the lagged version of itself: `var <- lag(var)` – Nat Mar 26 '16 at 02:23