This is my first Stack Overflow post. I researched extensively but have not found a similar post.
I am trying to impute the median for NA values based on two conditions.
Here is my code:
#Create sample of original data for reproducibility
Date<-c("2009-05-01","2009-05-02","2009-05-03","2009-06-01","2009-06-02",
"2009-06-03", "2010-05-01","2010-05-02","2010-05-03","2010-06-01",
"2010-06-02","2010-06-03","2011-05-01","2011-05-02","2011-05-03",
"2011-06-01","2011-06-02","2011-06-03")
Month<- c("May","May","May","June","June","June",
"May","May","May","June","June","June",
"May","May","May","June","June","June")
DayType<- c("Monday","Tuesday","Wednesday","Monday","Tuesday","Wednesday",
"Monday","Tuesday","Wednesday","Monday","Tuesday","Wednesday",
"Monday","Tuesday","Wednesday","Monday","Tuesday","Wednesday")
Qty<- c(NA,NA,NA,NA,NA,NA,
1,2,1,10,15,13,
3,2,5,20,14,16)
#Combine into dataframe
Example<-data.frame(Date,Month,DayType,Qty)
#Test output
Example
# Make a separate dataframe to calculate the median value based on day of the month
test1 <- ddply(Example,. (DayType,Month),summarize,median=median(Qty,na.rm=TRUE))
This works as expected. Test1 output looks like this:
DayType Month Median
Monday June 15.0
Monday May 2.0
Tuesday June 14.5
Tuesday May 2.0
Wednesday June 14.5
Wednesday May 3.0
My second step replaces "NA" values in the original dataset with the medians calculated in test1. This is where my issue comes in.
Example$Qty[is.na(Example$Qty)] <- test1$median[match(Example$DayType,test1$DayType,Example$Month,test1$Month)][is.na(Example$Qty)]
Example
Match[] only matches on the median value for each day, rather than the median value for each day by month. The output is the same seven repeating values for the entire set. I have not figured out how to match on both columns simultaneously.
Output:
Date DayType Month GSEvtQty
2009-05-01 Monday May 15.0 *should be 2.0, matching to June
2009-05-02 Tuesday May 14.5 *should be 2.0, matching to June
2009-05-03 Wednesday May 14.5 *should be 3.0, matching to June
2009-06-01 Monday June 15.0 *imputes correctly
2009-06-02 Tuesday June 14.5 *imputes correctly
2009-06-03 Wednesday June 14.5 *imputes correctly
2010-05-01 Monday May 1.0
2010-05-02 Tuesday May 2.0
2010-05-03 Wednesday May 1.0
2010-06-01 Monday June 10.0
2010-06-02 Tuesday June 15.0
2010-06-03 Wednesday June 13.0
I have also tried using %in%:
Example$Qty[is.na(Example$Qty)] <- test1$median[Example$DayType %in% test1$DayType & Example$Month %in% test1$Month][is.na(Example$Qty)]
But that does not match correctly and only outputs a limited number of values rather than over the entire series of NAs.
Using na.aggregate via the Zoo package as cleverly suggested by @Jaap:
setDT(Example)[, Value := na.aggregate("Qty", FUN = median), by = c("DayType","Month")]
For some reason does not transform the NAs:
Output:
Date Month DayType Qty
2009-05-01 May Monday NA
2009-05-02 May Tuesday NA
2009-05-03 May Wednesday NA
2009-06-01 June Monday NA
Any suggestions would be greatly appreciated! Thanks for sticking with this post for so long and look forward to paying the assistance forward in the future.