2

I am not an experienced coder and have just started learning R the past few weeks to help with some work related to my PhD. Here is the issue:

I have been trying unsuccessfully for many, many hours to impute missing values into a data set using the missForest package in R. Below is a representative example of the problem I'm having with a fabricated data set.

The data set contains numeric values that are categorical. Upon importing I use the following code to set the class to "factor"


    data <- read.csv("~Data.csv", colClasses = c(rep('factor',3)))

>data  
a   b   c  
1   2   3  
4   5      
7   8   9

To verify the class was set properly I run:

missForest::varClass(data) 

returns:

[1] "factor" "factor" "factor"

I then attempt to impute and view the data but I get the original data set back with the datapoint still missing instead of having an imputed value inserted.

    data.imp <- missForest(data)
    data.imp$ximp

a   b   c  
1   2   3  
4   5      
7   8   9  

The above example shows how I am importing the data and converting it to factor and attempting to impute the missing data. The below example is a reproducible example the creates the same problem.

The below example should be reproducible in R

I am using R version 3.5.3 (2019-03-11)

#install and load the missForest package and library
install.packages("missForest")
library(missForest)
#create the test data frame with a missing value in column c
a <- c("1","4","7")
b <- c("2","5","8")
c <- c("3","","9")
data.test <- data.frame(a,b,c)
#print the data
data.test
#view the class of the data to ensure it is "factor"
missForest::varClass(data.test)
#create the imputed data frame using missForest
data.test.imp <- missForest(data.test)
#print the imputed data frame
data.test.imp$ximp

The above code returns the following with the value in column c still missing

> data.test
  a b c
1 1 2 3
2 4 5  
3 7 8 9
> missForest::varClass(data.test)
[1] "factor" "factor" "factor"
> data.test.imp <- missForest(data.test)
  missForest iteration 1 in progress...done!
  missForest iteration 2 in progress...done!
> data.test.imp$ximp
  a b c
1 1 2 3
2 4 5  
3 7 8 9

If I convert all the data to numeric, it will impute values into the missing data points, although those imputed values are decimals and all my data are integers, but it works none the less...

The real data set I'm using is much larger but I am having the exact same issue with it.

Further, if I follow the example in the missForest manual using the iris data set everything works as it should. But if I download the same data set from UCI repository and manually remove a categorical data point and try to run the same code it doesn't work.

I'm sure there is something minor that I am missing but after hours of trying to figure this out I'm stuck.

Community
  • 1
  • 1
cbKCnSTL
  • 21
  • 3
  • I dont understand what the shape of your data is. Can you please make this a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Conor Neilson Mar 29 '20 at 04:02
  • I had the formatting messed up but I believe I have it corrected now and hopefully it makes more sense – cbKCnSTL Mar 29 '20 at 04:26
  • Where is this UCI repository and how can we download this data? – Edward Mar 29 '20 at 06:17
  • @cbKCnSTL Please learn how to make a minimal, self-contained, reproducible example as described here: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610 – jay.sf Mar 29 '20 at 08:24
  • 1
    @jay.sf I added the reproducible code into my question. You should be able to paste it directly into R and run it and get the same output I show as well. I left my original information as well but I can delete it if that would be better. Please let me know if I need to change or add anything. I appreciate the help. – cbKCnSTL Mar 29 '20 at 15:36
  • @Edward [link](https://archive.ics.uci.edu/ml/datasets/iris) will give you the data set – cbKCnSTL Mar 29 '20 at 15:44
  • @cbKCnSTL Well done, see my answer below. – jay.sf Mar 29 '20 at 15:48

1 Answers1

2

This really seems to be a minor issue. In your data.test you have empty strings which need to be coded as missing.

You can test that with str:

str(data.test)
# 'data.frame': 3 obs. of  3 variables:
# $ a: Factor w/ 3 levels "1","4","7": 1 2 3
# $ b: Factor w/ 3 levels "2","5","8": 1 2 3
# $ c: Factor w/ 3 levels "","3","9": 2 1 3

You see, the levels of variable c contains "" which is also coded as a category.

You can easily fix that by doing

data.test[data.test == ""] <- NA
data.test
#   a b    c
# 1 1 2    3
# 2 4 5 <NA>
# 3 7 8    9

Now, missForest works:

data.test.imp <- missForest::missForest(data.test)
data.test.imp$ximp
#   a b c
# 1 1 2 3
# 2 4 5 9
# 3 7 8 9
jay.sf
  • 60,139
  • 8
  • 53
  • 110