7

I know that similar questions have been asked before (e.g., 1, 2, 3), but I still do not understand the reason why mice is failing to predict missing values even when I try unconditioned mean like in the example 1.

The sparse matrix I have is:

            k1    k3       k5       k6       k7       k8      k11      k12      k13      k14      k15
 [1,]       NA    NA       NA       NA       NA       NA       NA       NA       NA       NA 0.066667
 [2,] 0.909091    NA       NA       NA       NA 0.944723       NA       NA 0.545455       NA       NA
 [3,] 0.545455    NA       NA       NA       NA       NA       NA       NA 0.818182 0.800000 0.466667
 [4,] 0.545455    NA 0.642857       NA       NA 0.260954       NA       NA       NA       NA       NA
 [5,]       NA 0.750 0.500000       NA 0.869845       NA 0.595013       NA       NA       NA       NA
 [6,] 0.727273 0.625       NA 0.583333       NA       NA       NA 0.500000 0.545455       NA       NA
 [7,]       NA    NA 0.571429       NA       NA       NA       NA       NA       NA       NA 0.866667
 [8,] 0.545455    NA       NA       NA       NA 0.905593 0.677757       NA       NA       NA       NA
 [9,]       NA 0.999 0.714286 0.750000       NA       NA 0.881032       NA       NA 0.933333 0.733333
[10,]       NA 0.750       NA       NA       NA       NA       NA       NA 0.545455       NA       NA
[11,]       NA    NA       NA       NA       NA       NA       NA       NA 0.818182       NA       NA
[12,]       NA 0.999       NA 0.583333       NA       NA 0.986145 0.666667 0.909091       NA       NA
[13,] 0.818182    NA 0.857143 0.583333 0.001000       NA       NA       NA       NA 0.133333       NA
[14,]       NA 0.999 0.357143       NA 0.635087       NA       NA       NA       NA       NA       NA
[15,]       NA 0.750 0.857143 0.250000 0.742082 0.001000 0.001000       NA 0.636364       NA 0.533333
[16,]       NA 0.999       NA 0.250000       NA       NA       NA       NA 0.909091       NA       NA
[17,] 0.727273 0.999 0.001000       NA       NA       NA 0.886366 0.666667 0.909091 0.800000 0.933333
[18,]       NA    NA 0.571429       NA       NA 0.953382       NA 0.833333 0.727273       NA       NA
[19,]       NA    NA       NA       NA 0.661476       NA       NA 0.500000       NA 0.933333 0.600000
[20,]       NA    NA 0.857143       NA 0.661661 0.459014 0.283793       NA       NA       NA       NA
[21,]       NA    NA       NA       NA       NA       NA       NA       NA       NA       NA 0.800000
[22,] 0.454545    NA       NA       NA       NA       NA       NA 0.333333 0.727273       NA 0.533333
[23,]       NA    NA       NA 0.333333 0.790737       NA       NA       NA 0.727273 0.433333       NA
[24,]       NA 0.875       NA       NA       NA       NA       NA       NA       NA 0.999000       NA
[25,]       NA    NA 0.571429 0.583333       NA       NA 0.196147 0.500000       NA       NA       NA
[26,]       NA 0.999 0.642857 0.250000       NA       NA       NA       NA 0.636364 0.700000       NA
[27,]       NA    NA 0.714286       NA       NA       NA       NA       NA       NA       NA       NA
[28,]       NA 0.875       NA 0.500000       NA       NA       NA       NA       NA       NA 0.666667
[29,] 0.636364 0.750       NA       NA       NA 0.999000 0.999000       NA       NA       NA       NA
[30,] 0.727273    NA       NA       NA 0.916098 0.734748       NA       NA       NA 0.833333       NA
[31,]       NA    NA       NA       NA       NA       NA       NA       NA       NA       NA 0.733333
[32,]       NA 0.875       NA 0.500000       NA       NA       NA       NA 0.818182       NA       NA
[33,] 0.636364    NA       NA       NA       NA       NA 0.829819       NA 0.727273       NA 0.733333
[34,]       NA    NA 0.500000       NA       NA       NA       NA       NA       NA       NA 0.666667
[35,]       NA    NA 0.214286       NA       NA 0.529592       NA 0.001000 0.909091       NA       NA
[36,]       NA    NA       NA 0.416667 0.808369       NA       NA 0.500000 0.909091 0.633333 0.733333
[37,]       NA    NA 0.357143       NA       NA 0.837555 0.755077       NA 0.818182       NA       NA
[38,]       NA    NA       NA 0.166667 0.841643 0.364216       NA       NA       NA 0.733333       NA
[39,]       NA    NA 0.500000 0.750000       NA       NA       NA       NA 0.818182 0.999000 0.800000
[40,]       NA    NA       NA       NA 0.931836       NA       NA       NA       NA       NA 0.133333
[41,]       NA    NA 0.714286       NA       NA 0.848688       NA       NA       NA       NA       NA
[42,]       NA    NA 0.214286 0.333333 0.700812 0.208412       NA 0.333333       NA       NA       NA
[43,] 0.454545    NA       NA       NA 0.109326 0.346767 0.877241 0.833333       NA       NA       NA
[44,] 0.818182    NA 0.857143       NA       NA 0.931636       NA       NA       NA 0.733333       NA
[45,] 0.363636 0.750       NA       NA       NA       NA       NA 0.166667 0.818182       NA       NA
[46,]       NA    NA 0.785714       NA 0.738672       NA       NA       NA       NA 0.100000       NA
[47,] 0.181818    NA       NA       NA       NA       NA       NA       NA       NA       NA 0.001000
[48,]       NA    NA 0.001000 0.083333 0.308050 0.139592       NA 0.166667       NA       NA       NA
[49,]       NA    NA       NA       NA 0.561841 0.817696       NA 0.666667       NA 0.300000       NA
[50,]       NA    NA       NA 0.416667       NA       NA       NA       NA 0.545455       NA 0.866667
[51,]       NA 0.875       NA       NA 0.039781       NA       NA       NA       NA 0.933333       NA
[52,]       NA    NA 0.357143       NA       NA       NA       NA 0.333333       NA       NA       NA
[53,]       NA 0.999       NA       NA       NA 0.835015       NA       NA       NA 0.833333 0.666667
[54,]       NA 0.750       NA 0.416667       NA       NA 0.623528 0.333333 0.818182       NA       NA
[55,]       NA    NA       NA 0.666667       NA 0.878312       NA       NA       NA       NA       NA                                                      

And I apply the following standard mice function

res <- mice(Sparse_Data, maxit = 30, meth = "mean", seed = 500, print = FALSE)
t <- complete(res, action = "long", TRUE) # all the estimations in 10 iterations
out <- split(t , f = t$.imp)[-1] 
a <- Reduce("+", out) / length(out)
data_Pred <- a[, 3:ncol(a)]

The predicted matrix I get is:

           k1        k3        k5        k6        k7        k8      k11       k12       k13       k14      k15
56  0.6060607 0.8676667 0.5373542 0.4429824 0.6069598 0.6313629       NA 0.4583958 0.7561986 0.6959606 0.066667
57  0.9090910 0.8676667 0.5373542 0.4429824 0.6069598 0.9447230       NA 0.4583958 0.5454550 0.6959606       NA
58  0.5454550 0.8676667 0.5373542 0.4429824 0.6069598 0.6313629       NA 0.4583958 0.8181820 0.8000000 0.466667
59  0.5454550 0.8676667 0.6428570 0.4429824 0.6069598 0.2609540       NA 0.4583958 0.7561986 0.6959606       NA
60  0.6060607 0.7500000 0.5000000 0.4429824 0.8698450 0.6313629 0.595013 0.4583958 0.7561986 0.6959606       NA
61  0.7272730 0.6250000 0.5373542 0.5833330 0.6069598 0.6313629       NA 0.5000000 0.5454550 0.6959606       NA
62  0.6060607 0.8676667 0.5714290 0.4429824 0.6069598 0.6313629       NA 0.4583958 0.7561986 0.6959606 0.866667
63  0.5454550 0.8676667 0.5373542 0.4429824 0.6069598 0.9055930 0.677757 0.4583958 0.7561986 0.6959606       NA
64  0.6060607 0.9990000 0.7142860 0.7500000 0.6069598 0.6313629 0.881032 0.4583958 0.7561986 0.9333330 0.733333
65  0.6060607 0.7500000 0.5373542 0.4429824 0.6069598 0.6313629       NA 0.4583958 0.5454550 0.6959606       NA
66  0.6060607 0.8676667 0.5373542 0.4429824 0.6069598 0.6313629       NA 0.4583958 0.8181820 0.6959606       NA
67  0.6060607 0.9990000 0.5373542 0.5833330 0.6069598 0.6313629 0.986145 0.6666670 0.9090910 0.6959606       NA
68  0.8181820 0.8676667 0.8571430 0.5833330 0.0010000 0.6313629       NA 0.4583958 0.7561986 0.1333330       NA
69  0.6060607 0.9990000 0.3571430 0.4429824 0.6350870 0.6313629       NA 0.4583958 0.7561986 0.6959606       NA
70  0.6060607 0.7500000 0.8571430 0.2500000 0.7420820 0.0010000 0.001000 0.4583958 0.6363640 0.6959606 0.533333
71  0.6060607 0.9990000 0.5373542 0.2500000 0.6069598 0.6313629       NA 0.4583958 0.9090910 0.6959606       NA
72  0.7272730 0.9990000 0.0010000 0.4429824 0.6069598 0.6313629 0.886366 0.6666670 0.9090910 0.8000000 0.933333
73  0.6060607 0.8676667 0.5714290 0.4429824 0.6069598 0.9533820       NA 0.8333330 0.7272730 0.6959606       NA
74  0.6060607 0.8676667 0.5373542 0.4429824 0.6614760 0.6313629       NA 0.5000000 0.7561986 0.9333330 0.600000
75  0.6060607 0.8676667 0.8571430 0.4429824 0.6616610 0.4590140 0.283793 0.4583958 0.7561986 0.6959606       NA
76  0.6060607 0.8676667 0.5373542 0.4429824 0.6069598 0.6313629       NA 0.4583958 0.7561986 0.6959606 0.800000
77  0.4545450 0.8676667 0.5373542 0.4429824 0.6069598 0.6313629       NA 0.3333330 0.7272730 0.6959606 0.533333
78  0.6060607 0.8676667 0.5373542 0.3333330 0.7907370 0.6313629       NA 0.4583958 0.7272730 0.4333330       NA
79  0.6060607 0.8750000 0.5373542 0.4429824 0.6069598 0.6313629       NA 0.4583958 0.7561986 0.9990000       NA
80  0.6060607 0.8676667 0.5714290 0.5833330 0.6069598 0.6313629 0.196147 0.5000000 0.7561986 0.6959606       NA
81  0.6060607 0.9990000 0.6428570 0.2500000 0.6069598 0.6313629       NA 0.4583958 0.6363640 0.7000000       NA
82  0.6060607 0.8676667 0.7142860 0.4429824 0.6069598 0.6313629       NA 0.4583958 0.7561986 0.6959606       NA
83  0.6060607 0.8750000 0.5373542 0.5000000 0.6069598 0.6313629       NA 0.4583958 0.7561986 0.6959606 0.666667
84  0.6363640 0.7500000 0.5373542 0.4429824 0.6069598 0.9990000 0.999000 0.4583958 0.7561986 0.6959606       NA
85  0.7272730 0.8676667 0.5373542 0.4429824 0.9160980 0.7347480       NA 0.4583958 0.7561986 0.8333330       NA
86  0.6060607 0.8676667 0.5373542 0.4429824 0.6069598 0.6313629       NA 0.4583958 0.7561986 0.6959606 0.733333
87  0.6060607 0.8750000 0.5373542 0.5000000 0.6069598 0.6313629       NA 0.4583958 0.8181820 0.6959606       NA
88  0.6363640 0.8676667 0.5373542 0.4429824 0.6069598 0.6313629 0.829819 0.4583958 0.7272730 0.6959606 0.733333
89  0.6060607 0.8676667 0.5000000 0.4429824 0.6069598 0.6313629       NA 0.4583958 0.7561986 0.6959606 0.666667
90  0.6060607 0.8676667 0.2142860 0.4429824 0.6069598 0.5295920       NA 0.0010000 0.9090910 0.6959606       NA
91  0.6060607 0.8676667 0.5373542 0.4166670 0.8083690 0.6313629       NA 0.5000000 0.9090910 0.6333330 0.733333
92  0.6060607 0.8676667 0.3571430 0.4429824 0.6069598 0.8375550 0.755077 0.4583958 0.8181820 0.6959606       NA
93  0.6060607 0.8676667 0.5373542 0.1666670 0.8416430 0.3642160       NA 0.4583958 0.7561986 0.7333330       NA
94  0.6060607 0.8676667 0.5000000 0.7500000 0.6069598 0.6313629       NA 0.4583958 0.8181820 0.9990000 0.800000
95  0.6060607 0.8676667 0.5373542 0.4429824 0.9318360 0.6313629       NA 0.4583958 0.7561986 0.6959606 0.133333
96  0.6060607 0.8676667 0.7142860 0.4429824 0.6069598 0.8486880       NA 0.4583958 0.7561986 0.6959606       NA
97  0.6060607 0.8676667 0.2142860 0.3333330 0.7008120 0.2084120       NA 0.3333330 0.7561986 0.6959606       NA
98  0.4545450 0.8676667 0.5373542 0.4429824 0.1093260 0.3467670 0.877241 0.8333330 0.7561986 0.6959606       NA
99  0.8181820 0.8676667 0.8571430 0.4429824 0.6069598 0.9316360       NA 0.4583958 0.7561986 0.7333330       NA
100 0.3636360 0.7500000 0.5373542 0.4429824 0.6069598 0.6313629       NA 0.1666670 0.8181820 0.6959606       NA
101 0.6060607 0.8676667 0.7857140 0.4429824 0.7386720 0.6313629       NA 0.4583958 0.7561986 0.1000000       NA
102 0.1818180 0.8676667 0.5373542 0.4429824 0.6069598 0.6313629       NA 0.4583958 0.7561986 0.6959606 0.001000
103 0.6060607 0.8676667 0.0010000 0.0833330 0.3080500 0.1395920       NA 0.1666670 0.7561986 0.6959606       NA
104 0.6060607 0.8676667 0.5373542 0.4429824 0.5618410 0.8176960       NA 0.6666670 0.7561986 0.3000000       NA
105 0.6060607 0.8676667 0.5373542 0.4166670 0.6069598 0.6313629       NA 0.4583958 0.5454550 0.6959606 0.866667
106 0.6060607 0.8750000 0.5373542 0.4429824 0.0397810 0.6313629       NA 0.4583958 0.7561986 0.9333330       NA
107 0.6060607 0.8676667 0.3571430 0.4429824 0.6069598 0.6313629       NA 0.3333330 0.7561986 0.6959606       NA
108 0.6060607 0.9990000 0.5373542 0.4429824 0.6069598 0.8350150       NA 0.4583958 0.7561986 0.8333330 0.666667
109 0.6060607 0.7500000 0.5373542 0.4166670 0.6069598 0.6313629 0.623528 0.3333330 0.8181820 0.6959606       NA
110 0.6060607 0.8676667 0.5373542 0.6666670 0.6069598 0.8783120       NA 0.4583958 0.7561986 0.6959606       NA                                  

Maybe someone can shed some light on the problem?

slamballais
  • 3,161
  • 3
  • 18
  • 29
user3575876
  • 325
  • 1
  • 3
  • 10
  • The problem is in `res$pred`... the columns and rows for `k11` and `k15` are empty. That's why their imputed values are `NULL` and why they aren't filled in. Give me a sec to figure out why this happens :) – slamballais Mar 31 '16 at 10:28

1 Answers1

15

Answer

You have perfectly collinear columns in your dataset. Particularly:

  • k11 and k14
  • k8 and k15

The default behavior of mice is to remove perfectly collinear columns.

Solutions

  1. Find and remove perfectly collinear columns (e.g. mice:::find.collinear(Sparse_Data))
  2. Provide your own prediction matrix (mice(pred = my_prediction_matrix)).

Details

mice relies on its PredictionMatrix. This is a matrix that is used to determine from which columns the missing values of each variable are predicted. If a column is empty, then that variable will not be predicted, regardless of what method you specify.

You can check this matrix by running mice and then typing res$pred. As you can see, the columns for k11 and k15 are empty and therefore they aren't imputed.

So why does mice make those two columns empty? Well, mice calls the check.data function, which in turn calls find.collinear. This function will specify which variables are collinear, and mice removes these columns in subsequent steps.

Are any of your columns collinear? Well, yes:

cor(Sparse_Data, use = "pairwise.complete.obs")
            k1            k3          k5            k6          k7           k8        k11        k12          k13         k14         k15
k1   1.0000000  1.740412e-01  0.24932705            NA  0.17164319  0.640984131  0.3053596  0.4225772 -0.536055739 -0.50460872  0.97321365
k3   0.1740412  1.000000e+00 -0.42409199 -9.370804e-05 -0.38583663  0.361416106  0.5515156  0.6567106  0.634250161 -0.70631658  0.74001342
k5   0.2493271 -4.240920e-01  1.00000000  4.471829e-01  0.02679894  0.234850334 -0.6624768  0.4201946 -0.924517670 -0.45408744 -0.78628746
k6          NA -9.370804e-05  0.44718290  1.000000e+00 -0.35377747  0.818644775  0.6824749  0.8899878  0.147657537  0.27030472  0.49159991
k7   0.1716432 -3.858366e-01  0.02679894 -3.537775e-01  1.00000000  0.207791538 -0.6406942 -0.2863018  0.898687181  0.14987951 -0.70210859
k8   0.6409841  3.614161e-01  0.23485033  8.186448e-01  0.20779154  1.000000000  0.7491736  0.5219197  0.002468839 -0.13067177  1.00000000
k11  0.3053596  5.515156e-01 -0.66247684  6.824749e-01 -0.64069422  0.749173578  1.0000000  0.5925582  0.830372468 -1.00000000  0.83452358
k12  0.4225772  6.567106e-01  0.42019459  8.899878e-01 -0.28630180  0.521919747  0.5925582  1.0000000 -0.134937885 -0.49251775  0.92582043
k13 -0.5360557  6.342502e-01 -0.92451767  1.476575e-01  0.89868718  0.002468839  0.8303725 -0.1349379  1.000000000  0.29508347  0.13853862
k14 -0.5046087 -7.063166e-01 -0.45408744  2.703047e-01  0.14987951 -0.130671767 -1.0000000 -0.4925177  0.295083470  1.00000000  0.02558161
k15  0.9732137  7.400134e-01 -0.78628746  4.915999e-01 -0.70210859  1.000000000  0.8345236  0.9258204  0.138538625  0.02558161  1.00000000

As you can see, k11 is perfectly correlated with k14, and k15 with k8. This is why they get kicked out. As expected:

mice:::find.collinear(Sparse_Data)
# [1] "k11" "k15"

Demonstration #1 (NOT a solution)

Try specifying mice(pred = diag(ncol(Sparse_Data)), ...). You'll see that now it works. [Edit: For future readers: this is not a way to SOLVE the problem, just to show where the problem is.]

Demonstration #2 (NOT a solution)

Try running this code before your code and you'll see that it indeed works:

Sparse_Data$k11[1] <- 2
Sparse_Data$k15[1] <- 2
Sparse_Data$k8[1] <- 0.5
Sparse_Data$k14[1] <- 0.5
slamballais
  • 3,161
  • 3
  • 18
  • 29
  • Thank you A LOT for your very detailed and clear explanation!! – user3575876 Mar 31 '16 at 17:36
  • I was playing again with the result you provided and noticed that if I write `mice(pred = diag(ncol(Sparse_Data)), ...)` the PredictionMatrix of the `res$predictorMatrix` is a matrix of 0's. Nevertheless, the missing values are predicted. On the other hand, if I follow the instructions from the [ http://www.inside-r.org/packages/cran/mice/docs/mice ] and write `predictorMatrix = (1 - diag(1, ncol(Sparse_Data))`, the `res$predictorMatrix` indicates what are the variables that have been used but has the problem of NA's that haven't been treated. Maybe you can explain it? – user3575876 Jul 13 '16 at 13:36
  • @user3575876 Solution 1 (`diag()`) gives a matrix where the diagonal elements are value 1 and the other values are 0. This means that each variable is imputed from itself, i.e. each imputed value is equal to the mean of that variable. Solution 2 (`1 - diag()`) gives a matrix where the diagonal is 0 and all other values are 1. This leads to the situation where all but a variable itself are used for the imputation on that variable. Now, I am not sure what you mean with the last part of your comment? Are you saying that solution 1 does impute everything yet solution 2 doesnt? On this dataset? – slamballais Jul 13 '16 at 18:19
  • Yes, exactly. So the solution 1 is essentially a very simplistic imputation which is performed based on the non-missing values of the feature and does not rely on any other features of the matrix, right? – user3575876 Jul 13 '16 at 20:32
  • @user3575876 correct. I used it as an illustration to show that the mice package does something wrong when you do not specify a prediction matrix, so there must have been something wrong in the steps that mice made the prediction matrix. In general, mean impution (solution 1) is a bad idea. – slamballais Jul 14 '16 at 08:57