0

As a new R user I'm having trouble understanding why the NA valus in my dataframe keep changing. I'm running my code on Kaggle. Maybe that's where my problem is arising from?

Original dataframe titled "abc"

There are multiple columns that have NA values so I decided to try using multiple imputation to handle the na values.

So I created a new dataframe with just the columns that had na values and begin imputation This is the new dataframe titled "abc1"

abc1 <- select(abc, c(9,10,15,16,17,18,19,25,26))

#mice imputation
input_data = abc1

my_imp = mice(input_data, m=5, method="pmm", maxit=20)

summary(input_data$m_0_9)
my_imp$imp$m_0_9

When the imputation begins it creates 5 columns that contain new values to fill in for the NA values of column m_0_9 and I choose which column.

Imputation of column 'm_0_9'

Then I run this code:

final_clean_abc1 <- complete(my_imp,5)

This assigns the values from column 5 of the last image to the NA values in my "abc1" dataframe and saves as "final_clean_abc1."

Lastly I replace the columns from the original "abc" dataframe that had missing values with the new columns in "final_clean_abc1."

I know this probably isnt the cleanest:

abc$m_0_9 <- final_clean_abc1$m_0_9
abc$m_10_12 <- final_clean_abc1$m_10_12
abc$f_0_9 <- final_clean_abc1$f_0_9
abc$f_10_12 <- final_clean_abc1$f_10_12
abc$f_13_14 <- final_clean_abc1$f_13_14
abc$f_15 <- final_clean_abc1$f_15
abc$f_16 <- final_clean_abc1$f_16
abc$asian_pacific_islander <- final_clean_abc1$asian_pacific_islander
abc$american_indian <- final_clean_abc1$american_indian

Now that I have a dataframe 'abc' with no missing values this is where my problem arises. I should be seeing '162' for row 10 for the m_0_9 column but when I save my code and view it on Kaggle I get the value '7' for that specific row and column. As shown in the photo below.

"abc" dataframe with no NA values

Hopefully this makes sense I tried to be as specific as I could be.

  • 1
    Welcome on SO! Please make sure to make [your example reproductible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) using `dput(your_data)` or `dput(head(your_data))`, instead of screen captures, which cannot be used by others – MonJeanJean Dec 21 '21 at 06:54
  • First, I hope you know that the way you use mice is not canonical: Multiple imputation means that you use several imputations simultaneously instead of picking one imputation only. The idea is that the variance among the several imputations represents your uncertainty about the missing value. The way you do it, you act as if you actually knew the missing value. Anyway, about your question: I don't know what Kaggle does, but maybe it re-runs your code upon saving/viewing, changing the random numbers used by `mice`. Try setting a seed just before using `mice()`, like this: `set.seed(123)` – benimwolfspelz Dec 21 '21 at 08:00

2 Answers2

0

There are multiple stochastic processes going on in mice to impute multiple values for one target value, of which are then averaged. You should not expect the same result each time you run mice.

From the MICE documentation

In the first step, the dataset with missing values (i.e. the incomplete dataset) is copied several times. Then in the next step, the missing values are replaced with imputed values in each copy of the dataset. In each copy, slightly different values are imputed due to random variation. This results in mulitple imputed datasets. In the third step, the imputed datasets are each analyzed and the study results are then pooled into the final study result. In this Chapter, the first phase in multiple imputation, the imputation step, is the main topic. In the next Chapter, the analysis and pooling phases are discussed.

https://bookdown.org/mwheymans/bookmi/multiple-imputation.html

brucezepplin
  • 9,202
  • 26
  • 76
  • 129
0

We have a wonderful series of vignettes that detail the use of mice. Part of this series is the stochastic nature of the algorithm and how to fix that. Setting mice(yourdata, seed = 123) would generate the same set of multiple imputation every time.

Gerko Vink
  • 106
  • 2