17

Example data:

set.seed(1)
df <- data.frame(years=sort(rep(2005:2010, 12)), 
                 months=1:12, 
                 value=c(rnorm(60),NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))

head(df)
  years months      value
1  2005      1 -0.6264538
2  2005      2  0.1836433
3  2005      3 -0.8356286
4  2005      4  1.5952808
5  2005      5  0.3295078
6  2005      6 -0.8204684

Tell me please, how i can replace NA in df$value to median of others months? "value" must contain the median of value of all previous values for the same month. That is, if current month is May, "value" must contain the median value for all previous values of the month of May.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Sheridan
  • 615
  • 1
  • 8
  • 21

6 Answers6

16

you want to use the test is.na function:

df$value[is.na(df$value)] <- median(df$value, na.rm=TRUE)

which says for all the values where df$value is NA, replace it with the right hand side. You need the na.rm=TRUE piece or else the median function will return NA

to do this month by month, there are many choices, but i think plyr has the simplest syntax:

library(plyr)
ddply(df, 
      .(months), 
      transform, 
      value=ifelse(is.na(value), median(value, na.rm=TRUE), value))

you can also use data.table. this is an especially good choice if your data is large:

library(data.table)
DT <- data.table(df)
setkey(DT, months)

DT[,value := ifelse(is.na(value), median(value, na.rm=TRUE), value), by=months]

There are many other ways, but there are two!

Justin
  • 42,475
  • 9
  • 93
  • 111
  • +1 for the explanation. I don't use `plyr` much, so I'm just curious, what's the main difference between `transform` (which you used) and `summarize` which Sacha used? – A5C1D2H2I1M1N2O1R2T1 Aug 15 '12 at 15:41
  • 1
    `transform` is to alter or add a column to an existing `data.frame`. as in it will return the whole data frame given plus any new rows you added. `summarise` returns a "summary" like average per month or something and only returns the rows specified. – Justin Aug 15 '12 at 15:42
  • Nice, didn't know about `transform`. I thought there should be a way to do it in one line with `plyr`. – Sacha Epskamp Aug 15 '12 at 16:24
  • Similar question asked here: http://stackoverflow.com/questions/9322773/how-to-replace-na-with-mean-by-subset-in-r-impute-with-plyr/9322975#9322975 but with mean – Tyler Rinker Aug 15 '12 at 16:55
10

Or with ave

df <- data.frame(years=sort(rep(2005:2010, 12)),
months=1:12,
value=c(rnorm(60),NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))
df$value[is.na(df$value)] <- with(df, ave(value, months, 
   FUN = function(x) median(x, na.rm = TRUE)))[is.na(df$value)]

Since there are so many answers let's see which is fastest.

plyr2 <- function(df){
  medDF <- ddply(df,.(months),summarize,median=median(value,na.rm=TRUE))
df$value[is.na(df$value)] <- medDF$median[match(df$months,medDF$months)][is.na(df$value)]
  df
}
library(plyr)
library(data.table)
DT <- data.table(df)
setkey(DT, months)


benchmark(ave = df$value[is.na(df$value)] <- 
  with(df, ave(value, months, 
               FUN = function(x) median(x, na.rm = TRUE)))[is.na(df$value)],
          tapply = df$value[61:72] <- 
            with(df, tapply(value, months, median, na.rm=TRUE)),
          sapply = df[61:72, 3] <- sapply(split(df[1:60, 3], df[1:60, 2]), median),
          plyr = ddply(df, .(months), transform, 
                       value=ifelse(is.na(value), median(value, na.rm=TRUE), value)),
          plyr2 = plyr2(df),
          data.table = DT[,value := ifelse(is.na(value), median(value, na.rm=TRUE), value), by=months],
          order = "elapsed")
        test replications elapsed relative user.self sys.self user.child sys.child
3     sapply          100   0.209 1.000000     0.196    0.000          0         0
1        ave          100   0.260 1.244019     0.244    0.000          0         0
6 data.table          100   0.271 1.296651     0.264    0.000          0         0
2     tapply          100   0.271 1.296651     0.256    0.000          0         0
5      plyr2          100   1.675 8.014354     1.612    0.004          0         0
4       plyr          100   2.075 9.928230     2.004    0.000          0         0

I would have bet that data.table was the fastest.

[ Matthew Dowle ] The task being timed here takes at most 0.02 seconds (2.075/100). data.table considers that insignificant. Try setting replications to 1 and increasing the data size, instead. Or timing the fastest of 3 runs is also a common rule of thumb. More verbose discussion in these links :

Community
  • 1
  • 1
Luciano Selzer
  • 9,806
  • 3
  • 42
  • 40
  • 3
    +1 very clearly done. `data.table` really shines once data gets big and/or the grouping variable has lots of levels. with a different data set, all your timings would be significantly different. – Justin Aug 15 '12 at 15:48
  • How is `ave` really different from `tapply`? Is it just `tapply` with `mean` as default and slightly different syntax? – Sacha Epskamp Aug 15 '12 at 16:47
  • 2
    @SachaEpskamp The main difference is in the returned value. `ave` will return a vector of the same lenght as the `df` in this case while `tapply` will return an vector of lenght `unique(months)`. It's just a matter of what output is more convinient for you. – Luciano Selzer Aug 15 '12 at 16:56
9

There is another way to do this with dplyr.

If you want to replace all columns with their median, do:

library(dplyr)
df %>% 
   mutate_all(~ifelse(is.na(.), median(., na.rm = TRUE), .))

If you want to replace a subset of columns (such as "value" in OP's example), do:

df %>% 
  mutate_at(vars(value), ~ifelse(is.na(.), median(., na.rm = TRUE), .))
Sam H.
  • 91
  • 1
  • 1
4

Here's the most robust solution I can think of. It ensures the years are ordered correctly and will correctly compute the median for all previous months in cases where you have multiple years with missing values.

# first, reshape your data so it is years by months:
library(reshape2)
tmp <- dcast(years ~ months, data=df)  # convert data to years x months
tmp <- tmp[order(tmp$years),]          # order years
# now calculate the running median on each month
library(caTools)
# function to replace NA with rolling median
tmpfun <- function(x) {
  ifelse(is.na(x), runquantile(x, k=length(x), probs=0.5, align="right"), x)
}
# apply tmpfun to each column and convert back to data.frame
tmpmed <- as.data.frame(lapply(tmp, tmpfun))
# reshape back to long and convert 'months' back to integer
res <- melt(tmpmed, "years", variable.name="months")
res$months <- as.integer(gsub("^X","",res$months))
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
3

Sticking with base R, you can also try the following:

medians = sapply(split(df[1:60, 3], df[1:60, 2]), median)
df[61:72, 3] = medians
woliveirajr
  • 9,433
  • 1
  • 39
  • 49
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • This only works if exactly only rows 61 - 72 contain `NA`, which is likely not the case in OP's full dataset. – Sacha Epskamp Aug 15 '12 at 16:23
  • @SachaEpskamp, and hence a downvote? Sorry, but I don't see what else you expect. Does your solution provide a rolling median for more than one year of missing data? If so, again, I'm not a regular `plyr` user, so please update your answer with a worked example. – A5C1D2H2I1M1N2O1R2T1 Aug 15 '12 at 16:36
  • 1
    Sorry, was unnecessary indeed, but can't fix it. I spend too much time on Reddit down-voting things it becomes automatic :) As for `plyr`, Justins answer is much better. – Sacha Epskamp Aug 15 '12 at 16:44
  • 1
    @SachaEpskamp -- Here, I'll upvote to fix that for you. Cheers. – Josh O'Brien Aug 15 '12 at 17:00
  • 1
    @woliveirajr, ha. I looked at the edit before I saw your comment, and thought, "what a pointless edit" :) – GSee Aug 15 '12 at 18:12
  • @GSee, oh no. Now I've gone and created a mess. Josh, woliveirajr, thanks--summer of love in action, I see :-) – A5C1D2H2I1M1N2O1R2T1 Aug 15 '12 at 18:33
1

This is a way using plyr, it is not very pretty but I think it does what you want:

library("plyr")

# Make a separate dataframe with month as first column and median as second:
medDF <- ddply(df,.(months),summarize,median=median(value,na.rm=TRUE))

# Replace `NA` values in `df$value` with medians from the second data frame
# match() here ensures that the medians are entered in the correct elements.
df$value[is.na(df$value)] <- medDF$median[match(df$months,medDF$months)][is.na(df$value)]
Sacha Epskamp
  • 46,463
  • 20
  • 113
  • 131