replace NA in an R vector with adjacent values

Question

I have a dataframe which has merged player and team data for soccer seasons So for a particular player in a specific season I have data like

df <- data.frame(team=c(NA,"CRP",NA,"CRP","CRP",NA),
             player=c(NA,"Ed",NA,"Ed","Ed",NA),
             playerGame= c(NA,1,NA,2,3,NA),
             teamGame =c(1,2,3,4,5,6))

Where the NA's indicate that the player did not appear in that specific team game

How would I most efficiently replace the team and player NA's with "CRP" and "Ed" respectively and have a plGame output of, in this instance, 0,1,1,2,3,3

EDIT

Sorry, I wrote this when I woke up in the middle of the night and may have over-simplified my problem too much. Only one person seems to have picked up on the fact that this is a subset of a much larger set of data and even he/she did not follow it though that a straight hardcode replacement of player and team was insufficient Thanks for the replies. Dsee's hint for the na.locf in the zoo package and the first line of AK's answer appears to offer the best way forward

df$playerGame[df$teamGame == min(df$teamGame) & is.na(df$playerGame) == TRUE] <- 0
na.locf(df$playerGame)

This covers the eventuality of more than one NA to start the sequence. In my case the min(df$teamGame) will always be 1 so hardcoding that may speed things up

A more realistic example is here

library(zoo)
library(plyr)

newdf <- data.frame(team=c("CRP","CRP","CRP","CRP","CRP","CRP","TOT","TOT","TOT"),
             player=c(NA,"Ed",NA,"Bill","Bill",NA,NA,NA,"Tom"),
             playerGame= c(NA,1,NA,1,2,NA,NA,NA,1),
             teamGame =c(1,2,3,1,2,3,1,2,3))

I can now show the team for every row Each team plays three games in a season. Ed and Bill, play for CRP and appear in games 2 and 1,2 respectively. Tom plays for TOT in game 3 only. Assume that player names are unique(even in real world data)

It seems to me that I need to create another column, 'playerTeam'

newdf$playerTeam <- 0

for (i in 1:nrow(newdf)) {
newdf$playerTeam[i] <-ceiling(i/3)
}

I can then use this value to fill in the player gaps. I have used the sort functiom which omits NA

newdf <- ddply(newdf,.(playerTeam),transform,player=sort(player)[1])

I can then use the aforementioned code

newdf$playerGame[newdf$teamGame == 1 & is.na(newdf$playerGame) == TRUE] <- 0
newdf$playerGame <- na.locf(newdf$playerGame)

   team player playerGame teamGame playerTeam
1  CRP     Ed          0        1          1
2  CRP     Ed          1        2          1
3  CRP     Ed          1        3          1
4  CRP   Bill          1        1          2
5  CRP   Bill          2        2          2
6  CRP   Bill          2        3          2
7  TOT    Tom          0        1          3
8  TOT    Tom          0        2          3
9  TOT    Tom          1        3          3

I will need to build in season as well but that should not be a problem

Am I missing anything here?

I have several hundred thousand rows to process so any speed ups would be helpful. For instance I would probably want to avoid ddply and use a data.table approach or another apply function, right

Since team seems to always be CRP in your example you could set it to CRP with: df$team = 'CRP' but the idea of setting team to the nearest non-missing value is more interesting and I do not know how to do that. I cannot pick up a pattern for playerGame so cannot suggest a good way to create it. Is playerGame cumulative games played? — Mark Miller, Nov 22 '12 at 14:35
How does plGame relate to the rest of the data? How did you calculate your example of `0,1,1,2,3,3`? — Ricardo Saporta, Nov 22 '12 at 14:38
For the first part, see pages 10 and 11 of [An introduction to R](http://cran.r-project.org/doc/manuals/R-intro.pdf). For the `plGame` part, see `library(zoo);?na.locf`, or search this site for something like "[r] NA" — GSee, Nov 22 '12 at 15:53

score 2 · Accepted Answer · edited Jun 13 '14 at 13:10

There seem to be 2 parts to what you want:

You want to replace the player name & team with pre-determined values
You wish to carry forward the counts of the games through the list of playerGame

For (1), you could do:

df$team[is.na(df$team)] <- 'CRP'

Similarly you can alter the other component of the dataframe

For (2) you could do this:

if(is.na(df$playerGame[1])) {
    df$playerGame[1] <- 0
}
for(i in 2:length(df$playerGame)) { 
    if(is.na(x[i])) {
        df$playerGame[i] <- df$playerGame[i-1]
    }
}

then df$playerGame is:

[1] 0 1 1 2 3 3

Perhaps there is a very nifty way to do this, but this is clearly readable...

As no-one seems to be answering my edited version, I have given your answer as accepted as the part 2 is acceptable and forms part of my future work Thanks — pssguy, Nov 23 '12 at 18:54

score 1 · Answer 2 · answered Nov 22 '12 at 14:40

to select the NA's, for say player, use

  df$player[is.na(df$player)]

Then to assign a value to these use

  df$player[is.na(df$player)]  <- "Ed"

If you just want to assign the entire player column the same name, you do not need to select any values:

  df$player[]  <-  "Ed"   # you can omit the brackets [], which are shown just for emphasis

you can then do the same for df$team

On a side note, when you create your data frame, if you plan to add values other than those already there, you will likely want to add stringsAsFactors=FALSE

 data.frame( . , stringsAsFactors=FALSE)

tfr · Answer 3 · 2012-11-22T15:59:04.943

replace NA's for team and player:

df$team[is.na(df$team)] <- "CRP"
df$player[is.na(df$player)] <- "Ed"

There is probably a more efficient way to get the adjacent values, but is works.

If the first or/and the last value is NA, as in your example, I had to use two additional lines:

df$playerGame[df$teamGame == min(df$teamGame) & is.na(df$playerGame) == TRUE] <- 0
df$playerGame[df$teamGame == max(df$teamGame) & is.na(df$playerGame) == TRUE] <- max(df$playerGame, na.rm = TRUE)

For all other observations, this get the adjacent values:

df$playerGame[is.na(df$playerGame) == TRUE] <- df$playerGame[-1]

df

team player playerGame teamGame
CRP     Ed          0        1
CRP     Ed          1        2
CRP     Ed          1        3
CRP     Ed          2        4
CRP     Ed          3        5
CRP     Ed          3        6

For more than one team and/or player I would suggest to combine it with ddply (plyr).

Thanks for suggestion. It works for this example but not sure it is general enough for cases where the seies starts with 2 NA's — pssguy, Nov 22 '12 at 20:58

replace NA in an R vector with adjacent values

3 Answers3

Linked