8

I need to replace the missing values in the field "steps" by the median of "steps" calculated over that particular day (group by "date") with NA values removed. I have already referred this thread but my NA values aren't replaced. Can somebody help me find out where am I going wrong? I would prefer using base package/data table/plyr. The dataset looks approx. like this:-

        steps      date interval
    1:    NA 2012-10-01        0
    2:    NA 2012-10-01        5
    3:    NA 2012-10-01       10
    4:    NA 2012-10-01       15
    5:    NA 2012-10-01       20
   ---                          
17564:    NA 2012-11-30     2335
17565:    NA 2012-11-30     2340
17566:    NA 2012-11-30     2345
17567:    NA 2012-11-30     2350
17568:    NA 2012-11-30     2355

The structure and summary of the dataset(activity) are as shown below

 #str(activity)  
 Classes ‘data.table’ and 'data.frame': 17568 obs. of  3 variables:
     $ steps   : int  NA NA NA NA NA NA NA NA NA NA ...
     $ date    : Date, format: "2012-10-01" "2012-10-01" "2012-10-01" ...
     $ interval: int  0 5 10 15 20 25 30 35 40 45 ...

#summary(activity)
         steps             date               interval     
     Min.   :  0.00   Min.   :2012-10-01   Min.   :   0.0  
     1st Qu.:  0.00   1st Qu.:2012-10-16   1st Qu.: 588.8  
     Median :  0.00   Median :2012-10-31   Median :1177.5  
     Mean   : 37.38   Mean   :2012-10-31   Mean   :1177.5  
     3rd Qu.: 12.00   3rd Qu.:2012-11-15   3rd Qu.:1766.2  
     Max.   :806.00   Max.   :2012-11-30   Max.   :2355.0  
     NA's   :2304     

Things I have tried:

Datatable method:

activityrepNA<-activity[,steps := ifelse(is.na(steps), median(steps, na.rm=TRUE), steps), by=date]
summary(activityrepNA)
     steps             date               interval     
 Min.   :  0.00   Min.   :2012-10-01   Min.   :   0.0  
 1st Qu.:  0.00   1st Qu.:2012-10-16   1st Qu.: 588.8  
 Median :  0.00   Median :2012-10-31   Median :1177.5  
 Mean   : 37.38   Mean   :2012-10-31   Mean   :1177.5  
 3rd Qu.: 12.00   3rd Qu.:2012-11-15   3rd Qu.:1766.2  
 Max.   :806.00   Max.   :2012-11-30   Max.   :2355.0  
 NA's   :2304 

Using ave

activity$steps[is.na(activity$steps)] <- with(activity, ave(steps,date, FUN = function(x) median(x, na.rm = TRUE)))[is.na(activity$steps)]
> summary(activity)
     steps             date               interval     
 Min.   :  0.00   Min.   :2012-10-01   Min.   :   0.0  
 1st Qu.:  0.00   1st Qu.:2012-10-16   1st Qu.: 588.8  
 Median :  0.00   Median :2012-10-31   Median :1177.5  
 Mean   : 37.38   Mean   :2012-10-31   Mean   :1177.5  
 3rd Qu.: 12.00   3rd Qu.:2012-11-15   3rd Qu.:1766.2  
 Max.   :806.00   Max.   :2012-11-30   Max.   :2355.0  
 NA's   :2304

Attempt at ddply

cleandatapls<-ddply(activity, 
+       .(as.character(date)), 
+       transform, 
+       steps=ifelse(is.na(steps), median(steps, na.rm=TRUE), steps))
> summary(cleandatapls)
 as.character(date)     steps             date               interval     
 Length:17568       Min.   :  0.00   Min.   :2012-10-01   Min.   :   0.0  
 Class :character   1st Qu.:  0.00   1st Qu.:2012-10-16   1st Qu.: 588.8  
 Mode  :character   Median :  0.00   Median :2012-10-31   Median :1177.5  
                    Mean   : 37.38   Mean   :2012-10-31   Mean   :1177.5  
                    3rd Qu.: 12.00   3rd Qu.:2012-11-15   3rd Qu.:1766.2  
                    Max.   :806.00   Max.   :2012-11-30   Max.   :2355.0  
                    NA's   :2304   

Aggregate for calculating median

whynoclean<-aggregate(activity,by=list(activity$date),FUN=median,na.rm=TRUE)
> summary(whynoclean)
    Group.1               steps        date               interval   
 Min.   :2012-10-01   Min.   :0   Min.   :2012-10-01   Min.   :1178  
 1st Qu.:2012-10-16   1st Qu.:0   1st Qu.:2012-10-16   1st Qu.:1178  
 Median :2012-10-31   Median :0   Median :2012-10-31   Median :1178  
 Mean   :2012-10-31   Mean   :0   Mean   :2012-10-31   Mean   :1178  
 3rd Qu.:2012-11-15   3rd Qu.:0   3rd Qu.:2012-11-15   3rd Qu.:1178  
 Max.   :2012-11-30   Max.   :0   Max.   :2012-11-30   Max.   :1178  
                      NA's   :8                     

EDIT output for the code using mutate

activity %>% group_by(date) %>% mutate(steps = replace(steps, is.na(steps), median(steps, na.rm = T)))
Source: local data table [17,568 x 3]

   steps       date interval
1     NA 2012-10-01        0
2     NA 2012-10-01        5
3     NA 2012-10-01       10
4     NA 2012-10-01       15
5     NA 2012-10-01       20
6     NA 2012-10-01       25
7     NA 2012-10-01       30
8     NA 2012-10-01       35
9     NA 2012-10-01       40
10    NA 2012-10-01       45
..   ...        ...      ... 

UPDATE:

Steven Beaupre helped me realize that my approach for the imputing was flawed as there were specific dates having only NA values that were causing the problem as the median of NA's is NA. Used another suggested approach.

Community
  • 1
  • 1
Meeshu
  • 95
  • 8

1 Answers1

4

Try:

library(dplyr)
df %>% 
  group_by(date) %>% 
  mutate(steps = ifelse(is.na(steps), median(steps, na.rm = T), steps))

If for a given date, all steps are NAs, you could replace them with 0:

df %>% 
  group_by(date) %>% 
  mutate(steps = ifelse(all(is.na(steps)), 0,
                        ifelse(is.na(steps), median(steps, na.rm = T), steps)))
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
  • Thanks! Do I need to load a library? It is saying could not find function %>% – Meeshu May 16 '15 at 15:55
  • Try `install.packages("dplyr"); library(dplyr)` – Steven Beaupré May 16 '15 at 15:56
  • Hi, thanks again but i need to replace it with median of steps (calculated after removing the NA values) and not median of interval. I tried the code by replacing interval with steps and the NA still remains. Editing the question to specify it – Meeshu May 16 '15 at 16:11
  • This doesn't work: `df %>% group_by(date) %>% mutate(steps = replace(steps, is.na(steps), median(steps, na.rm = T)))` ? – Steven Beaupré May 16 '15 at 16:12
  • Nope :( Adding the output in the question for your reference! – Meeshu May 16 '15 at 16:14
  • Same output as ifelse. I have a feeling I am doing something fundamentally wrong :( Just to be sure I am not goofing this up. I should use "activity" (original datatable with NA) instead of "df" right? – Meeshu May 16 '15 at 16:20
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/77973/discussion-between-steven-beaupre-and-meeshu). – Steven Beaupré May 16 '15 at 16:20