I am analysing a dataset with over 450k rows about 100k rows in one of the columns I am looking at (pa1min_
) has NA
values, due to non-responses and other random factors. This column deals with workout times in minutes.
I don't think it makes sense to fill the NA
values with the mean or median given that it's nearly a quarter of the data and the biases that could potentially create. I would like to impute the missing observations with a linear regression. However, I receive an error message:
Error: vector memory exhausted (limit reached?)
In addition: There were 50 or more warnings (use warnings() to see the first 50)
This is my code:
# imputing using multiple imputation deterministic regression
imp_model <- mice(brfss2013, method="norm.predict", m=1)
# store data
data_imp <- complete(imp_model)
# multiple imputation
imp_model <- mice(brfss2013, m=5)
# building predictive mode
fit <- with(data=imp_model, lm(y ~ x + z))
# combining results
combined <- pool(fit)
Here is a link to the data (compressed) Data
Note: I really just want to fill impute for one column...the other columns in the dataframe are a mixture of characters, integers and factors, some with more than 2 levels.