Most efficient way to replace NAs in a data frame based on a subset of other row factors (using median as an estimate) in R

Question

I would like to estimate the values of a numeric variable in a data frame based on the median of the same variable given other factors. I would then like to replace the NA's for the numeric Variable with these estimates.

I have a data frame like this:

Fac1   Fac2   Var1
A      a      20
A      b      30
B      a      5
B      b      10
.
.
.

I have used the agregate function to find these medians for each combination of factors:

A a = 22
A b = 28
B a = 12
B b = 8

So any NA's in Var1 would be replaced with the corresponding median based on the combinations of the factors.

I understand that this may be done by replacing the missing values for each subset of the data individually, however that would become tedious quickly given more than two factors. I was wondering if there are some more efficient ways to get this result.

Don't calculate the medians separately. Use `Hmisc::impute` with a split-apply-combine function of your choice. — Roland, Nov 10 '16 at 10:29
To add to Roland's comment, you might want to look into packages such as dplyr or data.table for data manipulation/cleaning. — David, Nov 10 '16 at 10:30

Ronak Shah · Accepted Answer · 2016-11-10T11:46:22.187

1

You haven't provided a sample data but based on your question, I think this should work.

As @Roland mentioned no need to calculate median separately.

Assuming your dataframe as df. For every group (here Fac1 and Fac2) we calculate the median removing the NA values. Further we select only the indices which has NA values and replace it by its groups median value.

df$Var1[is.na(df$Var1)] <- ave(df$Var1,df$Fac1, df$Fac2, FUN=function(x) 
                                  median(x, na.rm = T)[is.na(df$Var1)]

UPDATE

On request of OP adding some information about ave function.

The first parameter in ave is the one on which you want to do any operation. So here the first parameter is Var1 for which we want to find the median. All the other parameters following that are the grouping variables. It could be any number. Here the grouping variables we have are Fac1 and Fac2. Now comes the function which we want to apply on our first parameter (Var1) for every group (Fac1 and Fac2) which we have defined in the grouping variable. So here for every unique group we are finding the median for that group.

edited Nov 10 '16 at 11:46

answered Nov 10 '16 at 10:33

Ronak Shah

377,200
20
156
213

Thank you for your quick reply. If it is okay could you elaborate a little more on how the ave function is working in this case, I am relatively new to R and have not encountered it before. – Jamesm131 Nov 10 '16 at 11:05
Also, I am sorry that I didn't provide a data sample. The data set that I am working with is large and I was trying to be as general as possible as to not give unimportant information. Could you suggest a better way of asking questions for future use? – Jamesm131 Nov 10 '16 at 11:07
@Jamesm131 Hi , I have updated the answer with some description. Do let me know if there is something which is still not clear. Also see [here](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) to understand how could you ask a better question. – Ronak Shah Nov 10 '16 at 11:47

Most efficient way to replace NAs in a data frame based on a subset of other row factors (using median as an estimate) in R

1 Answers1