0

I have a data frame with 7,000 observations and 196 variables, with NAs sprinkled throughout. I created a function to capture grouped means for each numeric variable from the data frame (187 numeric variables, 11 groups). I am now trying to replace the NAs with the appropriate variable grouped mean if the observation is part of a group.

Basically I'm looking to find the NAs in the frame and replace with the appropriate group mean variable.

If df[6501,174] is group 7 & NA, then replace with mean value of group 7's variable 174.

This is the smallest of the data frames I'm working with, and I'm concerned about efficiency.

The historical time series data is as follows:

str(HD_filtered)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   7032 obs. of  196 variables:
 $ Date: Factor w/ 87    levels "12/31/1993","03/31/1994",..: 1 2 2 2 2 2 2 2 2 2 ...
 $ V2: Factor w/ 1065 levels "","000361105",..: 246 183 312 31 80 87 132 124 121 211 ...
 $ V3: Factor w/ 744 levels "A S V","A V",..: 326 231 22 41 106 113 170 160 157 272 ...
 $ V4: Factor w/ 7 levels "BHS","BMU","CAN",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ V5: Factor w/ 68 levels "I2",..: 48 16 17 28 11 10 38 28 11 13 ...
 $ V6: Factor w/ 1 level "C": 1 1 1 1 1 1 1 1 1 1 ...
 $ V7: Factor w/ 11 levels "S1",..: 7 4 9 1 6 8 8 1 6 6 ...
 $ V8: Factor w/ 146 levels "SI1",..: 8 77 57 51 16 91 93 49 31 22 ...
 $ V9: Factor w/ 1259 levels "","3HCKT","3RVTL",..: 261 23 294 26 82 95 111 1
 $ V10: num  0.429 7.4 5 7.75 12 ...
 $ V11: num  0.839 2.117 0.97 1.237 1.934 ...
 $ V12: num  NA -0.176 0.262 0.012 0.146 ...
 $ V12: num  NA NA NA NA NA NA NA NA NA NA ...
 $ V13: num  NA NA NA NA NA NA NA NA NA NA ...
 $ V196: num NA .045 .62 .034 NA NA NA .012 .03 NA

I created a function to calculate means for V10:V196 based on groups (Date, V4, V5, V7, V8) using dplyr.

Summary_Stats_Function <- function(hd, cmn) {
  hd %>%
    group_by_(.dots = cmn) %>%
    summarise_each(funs(min, max, median, mean(., trim = 0.01, na.rm = TRUE), sd(., na.rm = TRUE)), V10:V196)
}

Universal_Summary_Stats_byV4 <- Summary_Stats_Function(HD_filtered, "V4")

Which gives summary stats:

str(U_sector_summ_stats)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   11 obs. of  936 variables:
 $ V4: Factor w/ 11 levels "S1",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ V10_min: num  0 0 0 0 0 0 0 0 0 0.5 ...
 $ V11_min: num  -1.0216 -1.8599 0.0501 -0.5723 NA ...
 $V196_min: num  -0.984 -0.815 -0.848 -0.981 -0.549 ...
 $V393_mean: num  4.087 2.716 5.116 2.813 0.589...
 $V588_mean: num  NA NA NA NA NA ...
 $V936_sd: num  107 103 120 103 129 ...

replace_with <- select(Universal_Summary_Stats_byV4, contains("_mean")

I'm trying to figure out how to take the mean results of this function held in replace_with and put back into HD_filtered such that the NAs are replaced with the appropriate group mean.

I have tried using 'for' loops and 'apply' functionality without success, and am probably getting hung up on logical syntax?

ekad
  • 14,436
  • 26
  • 44
  • 46
ACMgo
  • 31
  • 4
  • 1
    see [my answer overhere](http://stackoverflow.com/questions/32694313/handle-continous-missing-values-in-time-series-data/32694667#32694667) for some inspiration – Jaap Sep 23 '15 at 14:32
  • 1
    Furthermore: Welcome to StackOverflow & please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to produce a [minimal reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610). This will make it much easier for others to help you. – Jaap Sep 23 '15 at 14:35
  • 1
    `?zoo::na.aggregate` might help, but without sample data this is pure speculation. Please provide a reproducible example. – RHertel Sep 23 '15 at 14:36
  • 1
    This would be very simple with the `ave` function, but it makes dubious sense from a statistical point of view. Can you explain the goals of htis effort so a better strategy can be offered? – IRTFM Sep 23 '15 at 14:55
  • thanks. the zoo package looks to be a promising new direction. – ACMgo Sep 23 '15 at 20:13

1 Answers1

0

Maybe not an elegant solution, but here is a base R solution using merge() of data frames of grouped means and original data frame within nested for loops.

First, since you only want means, run your summarise_each() with only means to get an output of V10_mean - V196_mean.

Summary_Stats_Function <- function(hd, cmn) {
    hd %>%
    group_by_(.dots = cmn) %>%
    summarise_each(funs(mean(., trim = 0.01, na.rm = TRUE)), V10:V196)
}        

Then run nested for loops calling above function at group level and merging data frames in outer loop:

# ITERATE THROUGH EACH GROUP (ASSUMING MUTUALLY EXCLUSIVE)
for (grp in c("V4", "V5", "V7", "V8")) {

    replace_with <- Summary_Stats_Function(HD_filtered, grp)  

    mergedf <- merge(HD_filtered, replace_with, by=grp)

    # ITERATE THROUGH EACH NUMERIC COLUMN
    for (i in 10:196) {    
         mergedf[[i]][is.na(mergedf[[i]])] <- 
                mergedf[[paste0("V", i,"_mean")]][is.na(mergedf[[i]])]
    }
}
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Thanks, Parfait. I couldn't quite get that to work, but I think you're right with inelegant iterative nested for loops as the eventual result. That said I think I've successfully made things too complicated at the outset(e.g. a variety of stats in `Summary_Stats_Function` and a convoluted question), so I'm going to reorder the flow and see if that helps. Will update this if possible. – ACMgo Sep 25 '15 at 22:22
  • Curious though, what error do you receive? Do `_mean` columns in `replace_with` not exist? Does the merge `by` variable fail? Wrong data types (factors vs numeric)? – Parfait Sep 26 '15 at 03:25
  • Well, two things. One, running `summarise_each()` with only means was the right idea, but then I do not need to use `select()` in `replace_with`. Two, the error is shown as: `Error in fix.by(by.y, y) : 'by' must specify uniquely valid columns`. – ACMgo Sep 29 '15 at 18:52
  • Indeed. Both may be related as filtering for only `_means` dropped the group variable (V4, V5, V7, V8) which is later used in merge. I removed the `select()`. – Parfait Sep 29 '15 at 19:03
  • Fixed the `by` error. Your solution suggests iterating `grp`, but my convoluted question should have asked for a solution to iterating through the levels of say factor `V4`. So pick a factor with levels (e.g. `V4` with 11 levels, and replace NAs with the mean of that factor level if the observation belongs in that level/group. – ACMgo Sep 29 '15 at 19:20
  • Can you check `replace_with` results? Are means broken down by factor levels of the group? – Parfait Sep 30 '15 at 00:42
  • Yes, `replace_with` looks as follows: – ACMgo Sep 30 '15 at 18:16
  • Yes, `replace_with` looks as noted above `str(U_sector_summ_stats`, with the exception that there are 188 variables, not 936. The first variable is the factor level grouping (e.g. factor `V4`) and the rest are means for each level grouping per each of the other factors (10:196 of `HD_filtered`). – ACMgo Sep 30 '15 at 18:23
  • Correct. So why does our merge solution not work? Iterating through `grp` merges the means by each observation's factor levels. Specifically on first iteration, `V4` is the merged `by` variable and `V10_mean` holds `V10` means for each `V4` factor level. Then, we fill in `V10` NAs with `V10_mean`. Check data output to verify. – Parfait Sep 30 '15 at 19:56