NaN/inf values in scikit-learn manifold learning functions

Question

I have a manifold learning / non-linear dimensionality reduction problem where I know distances between objects up to some threshold, and beyond that I just know that the distance is "far". Also, in some cases some of the distances might be missing. I am trying to use sklearn.manifold in order to perform the task of finding a 1d representation. A natural representation would be to represent "far" distances an inf and missing distances as nan.

However, it seems that currently scikit-learn does not support nan and inf values in distance matrices given to manifold learning functions in sklearn.manifold, since I get ValueError: Array contains NaN or infinity.

Is there a conceptual reason for this? Some methods seem to be especially suitable for inf, e.g. non-metric MDS. Also I know that some implementations of these methods in other languages are able to handle missing/inf values.

Instead of using inf I have considered setting "far" values to a very large number, but I am not sure how this will affect the results.

Update:

I dug in the code of sklearn.manifold.MDS._smacof_single() and found a piece of code and a comment saying that "similarities with 0 are considered as missing values". Is this an undocumented way to specify missing-values? Does this work with all manifold functions?

score 0 · Answer 1 · edited May 23 '17 at 10:31

Short answer: As you mentioned the non-metric MDS is capable of working with incomplete dissimilarity matrices. You are right: Setting values to zero allows will be interpreted as missing values when using MDS(metric=False). It won't work for other manifold learning procedures that are not based on non-metric MDS, but there might be similar (non-documented) approaches available.

On your question concerning Replacing inf by high values will shape your low dimensional representation for sure. Whether this is valid rather is a conceptual question that one can only answer knowing the origin of the inf values. Is the inf-entries mean something like "these data are reeaaaalllyyyy distant from each other" replacement by high values can make sense (like in your case). If it is rather missing knowledge about the dissimilarity I would not recommend to replace by inf. If there is no other solution (like non-metric MDS or matrix completion) then I would rather recommend to replace by the median of the measurable distances in such cases (checkout Imputation).

Checkout my answer to a similar question from 2017.

NaN/inf values in scikit-learn manifold learning functions

1 Answers1

Linked