I have a dataframe which has merged player and team data for soccer seasons So for a particular player in a specific season I have data like
df <- data.frame(team=c(NA,"CRP",NA,"CRP","CRP",NA),
player=c(NA,"Ed",NA,"Ed","Ed",NA),
playerGame= c(NA,1,NA,2,3,NA),
teamGame =c(1,2,3,4,5,6))
Where the NA's indicate that the player did not appear in that specific team game
How would I most efficiently replace the team and player NA's with "CRP" and "Ed" respectively and have a plGame output of, in this instance, 0,1,1,2,3,3
EDIT
Sorry, I wrote this when I woke up in the middle of the night and may have over-simplified my problem too much. Only one person seems to have picked up on the fact that this is a subset of a much larger set of data and even he/she did not follow it though that a straight hardcode replacement of player and team was insufficient Thanks for the replies. Dsee's hint for the na.locf in the zoo package and the first line of AK's answer appears to offer the best way forward
df$playerGame[df$teamGame == min(df$teamGame) & is.na(df$playerGame) == TRUE] <- 0
na.locf(df$playerGame)
This covers the eventuality of more than one NA to start the sequence. In my case the min(df$teamGame) will always be 1 so hardcoding that may speed things up
A more realistic example is here
library(zoo)
library(plyr)
newdf <- data.frame(team=c("CRP","CRP","CRP","CRP","CRP","CRP","TOT","TOT","TOT"),
player=c(NA,"Ed",NA,"Bill","Bill",NA,NA,NA,"Tom"),
playerGame= c(NA,1,NA,1,2,NA,NA,NA,1),
teamGame =c(1,2,3,1,2,3,1,2,3))
I can now show the team for every row Each team plays three games in a season. Ed and Bill, play for CRP and appear in games 2 and 1,2 respectively. Tom plays for TOT in game 3 only. Assume that player names are unique(even in real world data)
It seems to me that I need to create another column, 'playerTeam'
newdf$playerTeam <- 0
for (i in 1:nrow(newdf)) {
newdf$playerTeam[i] <-ceiling(i/3)
}
I can then use this value to fill in the player gaps. I have used the sort functiom which omits NA
newdf <- ddply(newdf,.(playerTeam),transform,player=sort(player)[1])
I can then use the aforementioned code
newdf$playerGame[newdf$teamGame == 1 & is.na(newdf$playerGame) == TRUE] <- 0
newdf$playerGame <- na.locf(newdf$playerGame)
team player playerGame teamGame playerTeam
1 CRP Ed 0 1 1
2 CRP Ed 1 2 1
3 CRP Ed 1 3 1
4 CRP Bill 1 1 2
5 CRP Bill 2 2 2
6 CRP Bill 2 3 2
7 TOT Tom 0 1 3
8 TOT Tom 0 2 3
9 TOT Tom 1 3 3
I will need to build in season as well but that should not be a problem
Am I missing anything here?
I have several hundred thousand rows to process so any speed ups would be helpful. For instance I would probably want to avoid ddply and use a data.table approach or another apply function, right