0

This is my first Stack Overflow post. I researched extensively but have not found a similar post.

I am trying to impute the median for NA values based on two conditions.

Here is my code:

#Create sample of original data for reproducibility
Date<-c("2009-05-01","2009-05-02","2009-05-03","2009-06-01","2009-06-02",
        "2009-06-03", "2010-05-01","2010-05-02","2010-05-03","2010-06-01",
        "2010-06-02","2010-06-03","2011-05-01","2011-05-02","2011-05-03",
        "2011-06-01","2011-06-02","2011-06-03")
Month<- c("May","May","May","June","June","June",
          "May","May","May","June","June","June",
          "May","May","May","June","June","June")
DayType<- c("Monday","Tuesday","Wednesday","Monday","Tuesday","Wednesday",
            "Monday","Tuesday","Wednesday","Monday","Tuesday","Wednesday",
            "Monday","Tuesday","Wednesday","Monday","Tuesday","Wednesday")
Qty<- c(NA,NA,NA,NA,NA,NA,
        1,2,1,10,15,13,
        3,2,5,20,14,16)

#Combine into dataframe
Example<-data.frame(Date,Month,DayType,Qty)

#Test output
Example

# Make a separate dataframe to calculate the median value based on day of the month
test1 <- ddply(Example,. (DayType,Month),summarize,median=median(Qty,na.rm=TRUE))

This works as expected. Test1 output looks like this:

DayType   Month  Median
Monday    June   15.0
Monday    May    2.0
Tuesday   June   14.5
Tuesday   May    2.0
Wednesday June   14.5
Wednesday May    3.0

My second step replaces "NA" values in the original dataset with the medians calculated in test1. This is where my issue comes in.

Example$Qty[is.na(Example$Qty)] <- test1$median[match(Example$DayType,test1$DayType,Example$Month,test1$Month)][is.na(Example$Qty)]

Example

Match[] only matches on the median value for each day, rather than the median value for each day by month. The output is the same seven repeating values for the entire set. I have not figured out how to match on both columns simultaneously.

Output:
Date         DayType   Month   GSEvtQty
2009-05-01   Monday    May     15.0    *should be 2.0, matching to June
2009-05-02   Tuesday   May     14.5    *should be 2.0, matching to June
2009-05-03   Wednesday May     14.5    *should be 3.0, matching to June
2009-06-01   Monday    June    15.0    *imputes correctly
2009-06-02   Tuesday   June    14.5    *imputes correctly
2009-06-03   Wednesday June    14.5    *imputes correctly
2010-05-01   Monday    May     1.0     
2010-05-02   Tuesday   May     2.0  
2010-05-03   Wednesday May     1.0 
2010-06-01   Monday    June    10.0
2010-06-02   Tuesday   June    15.0  
2010-06-03   Wednesday June    13.0   

I have also tried using %in%:

Example$Qty[is.na(Example$Qty)] <- test1$median[Example$DayType %in% test1$DayType & Example$Month %in% test1$Month][is.na(Example$Qty)]

But that does not match correctly and only outputs a limited number of values rather than over the entire series of NAs.

Using na.aggregate via the Zoo package as cleverly suggested by @Jaap:

setDT(Example)[, Value := na.aggregate("Qty", FUN = median), by = c("DayType","Month")]

For some reason does not transform the NAs:

Output:
 Date         Month   DayType   Qty
 2009-05-01   May     Monday    NA
 2009-05-02   May     Tuesday   NA
 2009-05-03   May     Wednesday NA
 2009-06-01   June    Monday    NA

Any suggestions would be greatly appreciated! Thanks for sticking with this post for so long and look forward to paying the assistance forward in the future.

mk99
  • 1
  • 2
  • You could use `na.aggregate` from the `zoo`-package; [an example](https://stackoverflow.com/q/32694313/2204410). – Jaap Jul 04 '18 at 13:42
  • Related / possible duplicate: [*Replacing Missing Value in R*](https://stackoverflow.com/q/37919422/2204410) – Jaap Jul 04 '18 at 13:47
  • Great idea! I was unaware of that function. Thank you, @Jaap. I used variations of the below (derived from your excellent link), but still receive all NAs. Any idea what I am missing? setDT(info)[, Value := na.aggregate("GSEvtQty", FUN = median), by =c("DayType","Month")] – mk99 Jul 04 '18 at 15:49
  • Could you include a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610) in your question? – Jaap Jul 04 '18 at 16:15
  • Sure thing. Just updated. Thanks for your patience as I learn the ropes. Look forward to being easier to work with in the future! – mk99 Jul 04 '18 at 19:47
  • It isn't working because you quote `Qty` in `na.aggregate`. Working code: `setDT(Example)[, Value := na.aggregate(Qty, FUN = median), by = .(DayType, Month)][]` – Jaap Jul 04 '18 at 20:09
  • And a `dplyr` alternative: `Example %>% group_by(DayType, Month) %>% mutate(Qty = zoo::na.aggregate(Qty, median))` – Jaap Jul 04 '18 at 20:10
  • Both of these work great. Learned a lot. Thanks so much! – mk99 Jul 05 '18 at 11:36

1 Answers1

0

This is what merge was created for.

info$GSEvtQty[is.na(info$GSEvtQty)]<- merge(info[is.na(info$GSEvtQty,)], test1, by=c("DayType", "Month"))[,"GSEvtQty"]
iod
  • 7,412
  • 2
  • 17
  • 36
  • Go figure. Thank you, @doviod! The above outputted an error saying "undefined columns selected" so I modified it to the below (added a comma): `info$GSEvtQty[is.na(info$GSEvtQty)]<- merge(info[is.na(info$GSEvtQty),], test1, by=c("DayType", "Month"))[,"GSEvtQty"]` That allows it to run through, but it still outputs all NAs. Any thoughts? Again, really appreciate you and @Jaap. Know just enough R to be dangerous, but still have lots left to learn! – mk99 Jul 04 '18 at 16:39