I have a data frame with 7,000 observations and 196 variables, with NAs sprinkled throughout. I created a function to capture grouped means for each numeric variable from the data frame (187 numeric variables, 11 groups). I am now trying to replace the NAs with the appropriate variable grouped mean if the observation is part of a group.
Basically I'm looking to find the NAs in the frame and replace with the appropriate group mean variable.
If df[6501,174]
is group 7 & NA, then replace with mean value of group 7's variable 174.
This is the smallest of the data frames I'm working with, and I'm concerned about efficiency.
The historical time series data is as follows:
str(HD_filtered)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 7032 obs. of 196 variables:
$ Date: Factor w/ 87 levels "12/31/1993","03/31/1994",..: 1 2 2 2 2 2 2 2 2 2 ...
$ V2: Factor w/ 1065 levels "","000361105",..: 246 183 312 31 80 87 132 124 121 211 ...
$ V3: Factor w/ 744 levels "A S V","A V",..: 326 231 22 41 106 113 170 160 157 272 ...
$ V4: Factor w/ 7 levels "BHS","BMU","CAN",..: 7 7 7 7 7 7 7 7 7 7 ...
$ V5: Factor w/ 68 levels "I2",..: 48 16 17 28 11 10 38 28 11 13 ...
$ V6: Factor w/ 1 level "C": 1 1 1 1 1 1 1 1 1 1 ...
$ V7: Factor w/ 11 levels "S1",..: 7 4 9 1 6 8 8 1 6 6 ...
$ V8: Factor w/ 146 levels "SI1",..: 8 77 57 51 16 91 93 49 31 22 ...
$ V9: Factor w/ 1259 levels "","3HCKT","3RVTL",..: 261 23 294 26 82 95 111 1
$ V10: num 0.429 7.4 5 7.75 12 ...
$ V11: num 0.839 2.117 0.97 1.237 1.934 ...
$ V12: num NA -0.176 0.262 0.012 0.146 ...
$ V12: num NA NA NA NA NA NA NA NA NA NA ...
$ V13: num NA NA NA NA NA NA NA NA NA NA ...
$ V196: num NA .045 .62 .034 NA NA NA .012 .03 NA
I created a function to calculate means for V10:V196 based on groups (Date, V4, V5, V7, V8) using dplyr.
Summary_Stats_Function <- function(hd, cmn) {
hd %>%
group_by_(.dots = cmn) %>%
summarise_each(funs(min, max, median, mean(., trim = 0.01, na.rm = TRUE), sd(., na.rm = TRUE)), V10:V196)
}
Universal_Summary_Stats_byV4 <- Summary_Stats_Function(HD_filtered, "V4")
Which gives summary stats:
str(U_sector_summ_stats)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 11 obs. of 936 variables:
$ V4: Factor w/ 11 levels "S1",..: 1 2 3 4 5 6 7 8 9 10 ...
$ V10_min: num 0 0 0 0 0 0 0 0 0 0.5 ...
$ V11_min: num -1.0216 -1.8599 0.0501 -0.5723 NA ...
$V196_min: num -0.984 -0.815 -0.848 -0.981 -0.549 ...
$V393_mean: num 4.087 2.716 5.116 2.813 0.589...
$V588_mean: num NA NA NA NA NA ...
$V936_sd: num 107 103 120 103 129 ...
replace_with <- select(Universal_Summary_Stats_byV4, contains("_mean")
I'm trying to figure out how to take the mean results of this function held in replace_with
and put back into HD_filtered such that the NAs are replaced with the appropriate group mean.
I have tried using 'for' loops and 'apply' functionality without success, and am probably getting hung up on logical syntax?