Error while imputing large dataframe using mice

Question

I have been trying to impute a data set using the mice package using the following code,

my_imp <- mice(train, m=5, method="pmm", maxit=50)

and I got this error:

iter imp variable
  *1   1  existence.expectancy.indexError in solve.default(xtx + diag(pen)) : 
  system is computationally singular: reciprocal condition number = 3.96306e-17*

Here is a sample from my dataframe (dput). The error probably results from the existence.expectancy.index column.

structure(list(galactic.year = c(990025L, 990025L, 990025L, 990025L, 
990025L), galaxy = c("Large Magellanic Cloud (LMC)", "Camelopardalis B", 
"Virgo I", "UGC 8651 (DDO 181)", "Tucana Dwarf"), existence.expectancy.index = c(0.628656922579983, 
0.818082166933375, 0.659443179243005, 0.555861648365899, 0.991196351622249
)), class = "data.frame", row.names = c(NA, -5L))

Please give me ideas on how to solve the error.

Hello and welcome to SO, could you share a sample of your data. Without that it will be very hard to find ot where the problem lies. Use can `dput()` or `dput(head())` if the data set is large. Please help us help you. — Jan, Jun 09 '20 at 06:41
Hi, please read related Q/A: https://stackoverflow.com/a/58832614/6574038 Possible duplicate. — jay.sf, Jun 09 '20 at 06:42
@Afrikan_patriot What is different in your case that the provided error isolation approach there won't work? — jay.sf, Jun 09 '20 at 07:25
@Afrikan_patriot Thanks for updating. However, when you use `dput` better don't change the output when providing it. I tried to fix that in an edit to your question. If you want to `dput` a subset of your data, use e.g. `dput(dtrain[1:30, ])`. Anyway, I tried out your code and data and wasn't able to reproduce your error. Also question of my last comment might still be open. — jay.sf, Jun 09 '20 at 07:45
@jay.sf i've got the solution.The problem with using mice for imputation here is the large number of unbalanced factor variables in this dataset. When these are turned into dummy variables there is a high probability that you will have one column a linear combination of another. Since the default imputation methods involve linear regression, this results in a X matrix that cannot be inverted. One solution is to change the default imputation method to one that is not stochastic. — Afrikan_patriot, Jun 09 '20 at 10:38
@Afrikan_patriot You may put an own answer to your question. — jay.sf, Jun 09 '20 at 10:43

score 1 · Answer 1 · answered Jun 09 '20 at 10:49

The problem with using mice for imputation here is the large number of unbalanced factor variables in this dataset. When these are turned into dummy variables there is a high probability that you will have one column a linear combination of another. Since the default imputation methods involve linear regression, this results in a X matrix that cannot be inverted.

One solution is to change the default imputation method to one that is not stochastic.

Error while imputing large dataframe using mice

1 Answers1