I am having a hard time figuring out how to deal with NaN variables where data imputation doesn't make sense. I am trying to do text/document clustering and there are some missing values that needs to stay as missing because there is no sensible way to fill them. My dataset contains some numerical values, dates, texts, etc. Actually DannyDannyDanny 's example under the subtitle "Consider situtations when imputation doesn't make sense." is a great example for my problem. Right after vectorization, I need to perform PCA to reduce dimensionality so I can work with big data without memory error and reduce computation time. This is where the problem starts because none of the scikit-learn's PCA algorithms can deal with NaN's (or can they?). And filling missing values with sklearn.preprocessing.Imputer doesn't make sense because;
-Not all of them are numerical or continuous values. And in fact, there are some columns with and without dates!
-Some of them have to stay as NaN because otherwise they can (or might?) have unwanted effects on clustering.
And I can't just simply drop columns(or rows) because of just a couple of missing values. Too much to loose... My questions are:
- How can I deal with NaN values w/o effecting the outcome of clustering? (a sensible data imputation or something else...)
- Is there any PCA algorithm that can deal with NaN values in python?
PS: Sorry for my bad English