1

I have a dataset with 69 columns and over 50000 rows which is structured like this:

  • Some of the columns can only take 0 or 1 values (binary), for example:'isFemale', 'isChild', etc.

  • Some other columns can only take 0 or 1 values (binary) but are exclusive. For example, I have 3 columns called 'Primary.Language.ENGLISH', 'Primary.Language.SPANISH', 'Primary.Language.OTHER'. These columns are exclusive, so I can only have one of them True.

.

Primary.Language.ENGLISH    Primary.Language.SPANISH    Primary.Language.OTHER  
1                           0                           0       
0                           1                           0

I cannot have this (can't have more than one True in the same row)

Primary.Language.ENGLISH    Primary.Language.SPANISH    Primary.Language.OTHER    
1                           1                           0       

Both types of columns have NAs (about 4-5%) and I was thinking of performing imputation with mice package in R. However, I am afraid that, for the second type, I will have problems since imputation could not respect the constraint that I discussed above (can't have more than one '1' in the same row for each type of column of that type). Do you have any suggestions on how I could achieve it?

  • Probably a way to do this in mice but a quick fix is you could construct a factor variable for the groups (see https://stackoverflow.com/questions/29227111/convert-multiple-binary-columns-to-single-categorical-column) so that the info is encoded in one variable and then apply mice – user20650 Apr 22 '19 at 09:46

1 Answers1

0

I don't think there is an in-built parameter in mice to archive this.

What you can do it to transform your variable from binary to numeric. (e.g. a variable Primary.Language with 1 for English, 2 for Spanish, 3 OTHER)

If you use PPM (Predictive mean matching) as imputation algorithms using the method parameter, your constraint will be respected.

Imputations with PMM are based on values observed elsewhere. This means imputations outside the observed data range will not occur. So you won't get a 4 or 5 for the new variable as imputation.

After your imputation process you can transform back to your binary format if you need this.

Steffen Moritz
  • 7,277
  • 11
  • 36
  • 55