0

I'm struggeling with imputation via mice package to solve a NA problem in my data anlysis. I'm using lienar mixed models to calcultate inter class correlation coefficients (ICC's). in my final dataframe there are several control variables (as columns) that I use as fixed effects in the model. in some columns there are missing values. I have no further Problems to impute the NA by the following commands:

imputation_list <- mice(baseline_df,                      
                   method = "pmm",
                   m=5)                      # "pmm" == predictive mean matching (numeric data)

df_imputation_final= complete(imputation_list)

But now my problem:

The ID's (persons in rows) are subgrouped in multiple groups (families). So I have to impute the NA's, all persons within one family having the same imputation.

In the following dataframe I have to make imputations.

df_test <- data.frame(ID=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
                  family=c(Gerrard, Gerrard, Gerrard, Torres, Torres, Torres, Keita, Keita, Keita, Suarez, Suarez, Kuyt, Kuyt, Carragher, Carragher, Carragher, Salah, Salah, Firmono, Firmino )
                  income_family=c(NA, NA, NA,  100, 100, 100, 90, 90, 90, 150, 150, 40, 40, NA, NA, NA, 200, 200, 99, 99))

So all members/persons ("1", "2", "3" & "14", "15", "16") within families: "Gerrard", and "Carragher" need imputation in the income_family variable and the imputed values must be the same for all the members of the family. Should look like this:

  df_final <- data.frame(ID=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
                  family=c(Gerrard, Gerrard, Gerrard, Torres, Torres, Torres, Keita, Keita, Keita, Suarez, Suarez, Kuyt, Kuyt, Carragher, Carragher, Carragher, Salah, Salah, Firmono, Firmino )
                  income_family=c(55, 55, 55,  100, 100, 100, 90, 90, 90, 150, 150, 40, 40, 66, 66, 66, 200, 200, 99, 99))

I hope you know what I mean. Thx a lot !!

Max Herre
  • 47
  • 5
  • *Not* a qualified statistical procedure but what you could do just to get the model working: Calculate the mean of imputed values on family level and use that. It only makes sense, though, if the imputed values tend to be similar within families. – benimwolfspelz Nov 09 '22 at 07:51

1 Answers1

0

It's unclear what purpose the long ID variable serves if the values for income_family are the same for every observation of family. I believe the only way to achieve your desired result is to summarize your dataset before imputation.

df <- data.frame(ID=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
                      family=c("Gerrard", "Gerrard", "Gerrard", "Torres", "Torres", "Torres", "Keita", "Keita", "Keita", "Suarez", "Suarez", "Kuyt", "Kuyt", "Carragher", "Carragher", "Carragher", "Salah", "Salah", "Firmono", "Firmino"),
                      income_family=c(NA, NA, NA,  100, 100, 100, 90, 90, 90, 150, 150, 40, 40, NA, NA, NA, 200, 200, 99, 99))

df2 <- df %>%
  group_by(family) %>%
  summarize(income_family = mean(income_family))

# Same for every family
imputation_list <- mice(df2, m = 1, printFlag = FALSE)
df_imputation_final <- complete(imputation_list)

However, if you want to do proper modelling on multiply-imputed data, you will need to conduct your analyses on the mids object imputation_list, not the large dataframe df_imputation_final. If you're using lme4, see this post for details: Using imputed datasets from library mice() to fit a multi-level model in R

# Longitudinal multiple imputation
# https://rmisstastic.netlify.app/tutorials/erler_course_multipleimputation_2018/erler_practical_miadvanced_2018

imp <- mice(df, maxit = 0)
meth <- imp$meth
pred <- imp$pred
meth[c("income_family")] <- "2lonly.pmm"
pred[, "ID"] <- -2
pred[, "family"] <- 2

imputation_list <- mice::mice(df,
                              m = 5, maxit = 10,
                              method = meth,
                              seed = 123,
                              pred = pred,
                              printFlag = FALSE)

fit <- with(data = imputation_list, 
            exp = lme4::lmer(income_family ~ (1|family)))
pool(fit)
jrcalabrese
  • 2,184
  • 3
  • 10
  • 30