4

I am having a hard time figuring out how to deal with NaN variables where data imputation doesn't make sense. I am trying to do text/document clustering and there are some missing values that needs to stay as missing because there is no sensible way to fill them. My dataset contains some numerical values, dates, texts, etc. Actually DannyDannyDanny 's example under the subtitle "Consider situtations when imputation doesn't make sense." is a great example for my problem. Right after vectorization, I need to perform PCA to reduce dimensionality so I can work with big data without memory error and reduce computation time. This is where the problem starts because none of the scikit-learn's PCA algorithms can deal with NaN's (or can they?). And filling missing values with sklearn.preprocessing.Imputer doesn't make sense because;

-Not all of them are numerical or continuous values. And in fact, there are some columns with and without dates!

-Some of them have to stay as NaN because otherwise they can (or might?) have unwanted effects on clustering.

And I can't just simply drop columns(or rows) because of just a couple of missing values. Too much to loose... My questions are:

  1. How can I deal with NaN values w/o effecting the outcome of clustering? (a sensible data imputation or something else...)
  2. Is there any PCA algorithm that can deal with NaN values in python?

PS: Sorry for my bad English

MehmedB
  • 1,059
  • 1
  • 16
  • 42

2 Answers2

2

Intuitively, if you cannot impute using different methods, or it doesn't make sense, then you would drop those rows -> but caveat is you might end up with not many rows, depending on your data. This only works if you have an otherwise good data set with very small percentage of NaNs.

The other approach would be to drop the columns with very high NaNs, at which point they aren't very useful to the model anyway.

The last approach you can look into is to fill those values with something extreme, that isn't in the range for that column, a unique identifier like '-9999' or something you prefer. This would mostly allow the algorithm to pickup the outlier and not factor it into the model.

Hope this helps!

mlenthusiast
  • 1,094
  • 1
  • 12
  • 34
1

No.

PCA means that essentially every output variable depends to some degree on every input variable. So after projection, the entire vector would become NaN. Intuitively, a missing value (that you cannot impute as 0) means there is some direction that you can move your point arbitrarily. But because you can still move the point, you don't know it's position in any of the coordinates - it could be anywhere.

PCA makes mostly sense on low-dimensional continuous data. your description of the data does not sound as if PCA is appropriate to use here.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Great explanation. I was thinking about the same thing. Dummy me, couldn't think about filling those NaN values with a unique identifier as [codingenthusiast](https://stackoverflow.com/questions/55490072/how-to-deal-with-nan-values-where-imputation-doesnt-make-sense-for-pca/55505486#55505486) suggested. But his/her/[fillhere] solution still depends on the performance of clustering model. That's why I still have some doubts about filling NaN's. @codingenthusiast – MehmedB Apr 05 '19 at 09:20