LabelPropagation - How to avoid division by zero?

Question

When using LabelPropagation, I often run into this warning (imho it should be an error because it completely fails the propagation):

/usr/local/lib/python3.5/dist-packages/sklearn/semi_supervised/label_propagation.py:279: RuntimeWarning: invalid value encountered in true_divide self.label_distributions_ /= normalizer

So after few tries with the RBF kernel, I discovered the paramater gamma has an influence.

EDIT:

The problem comes from these lines:

        if self._variant == 'propagation':
            normalizer = np.sum(
                self.label_distributions_, axis=1)[:, np.newaxis]
            self.label_distributions_ /= normalizer

I don't get how label_distributions_ can be all zeros, especially when its definition is:

self.label_distributions_ = safe_sparse_dot(
graph_matrix, self.label_distributions_)

Gamma has an influence on the graph_matrix (because graph_matrix is the result of _build_graph() that call the kernel function). OK. But still. Something's wrong

OLD POST (before edit)

I remind you how graph weights are computed for the propagation: W = exp(-gamma * D), D the pairwise distance matrix between all points of the dataset.

The problem is: np.exp(x) returns 0.0 if x very small.
Let's imagine we have two points i and j such that dist(i, j) = 10.

>>> np.exp(np.asarray(-10*40, dtype=float)) # gamma = 40 => OKAY
1.9151695967140057e-174
>>> np.exp(np.asarray(-10*120, dtype=float)) # gamma = 120 => NOT OKAY
0.0

In practice, I'm not setting gamma manually but I'm using the method described in this paper (section 2.4).

So, how would one avoid this division by zero to get a proper propagation ?

The only way I can think of is to normalize the dataset in every dimension, but we lose some geometric/topologic property of the dataset (a 2x10 rectangle becoming a 1x1 square for example)

Reproductible example:

In this example, it's worst: even with gamma = 20 it fails.

In [11]: from sklearn.semi_supervised.label_propagation import LabelPropagation

In [12]: import numpy as np

In [13]: X = np.array([[0, 0], [0, 10]])

In [14]: Y = [0, -1]

In [15]: LabelPropagation(kernel='rbf', tol=0.01, gamma=20).fit(X, Y)
/usr/local/lib/python3.5/dist-packages/sklearn/semi_supervised/label_propagation.py:279: RuntimeWarning: invalid value encountered in true_divide
  self.label_distributions_ /= normalizer
/usr/local/lib/python3.5/dist-packages/sklearn/semi_supervised/label_propagation.py:290: ConvergenceWarning: max_iter=1000 was reached without convergence.
  category=ConvergenceWarning
Out[15]: 
LabelPropagation(alpha=None, gamma=20, kernel='rbf', max_iter=1000, n_jobs=1,
         n_neighbors=7, tol=0.01)

In [16]: LabelPropagation(kernel='rbf', tol=0.01, gamma=2).fit(X, Y)
Out[16]: 
LabelPropagation(alpha=None, gamma=2, kernel='rbf', max_iter=1000, n_jobs=1,
         n_neighbors=7, tol=0.01)

In [17]:

maybe code the RBF kernel with a threshold to avoid the return of a 0. — Frayal, Aug 28 '18 at 12:33
But if I do such thing, a point at d=15 will have less influence than 2 point at d=100. (something like 1e-99 against 2*1e-99) — politinsa, Aug 28 '18 at 12:53
normally i would say lower the gamma but that won't help as you find it with an algo. People tend to take gamme = 1/n_features. 20 is really huge for a gamma. I'm not sure you'll ever bump into it while doing your LabelPropagation. With a Gamma<1 i think this would not be a problem.....can't help you more sorry. — Frayal, Aug 28 '18 at 13:10
1/n_features clearly doesn't work for dataset like moons like `sklearn.datasets.make_moons(1000, noise=0.06)` — politinsa, Aug 28 '18 at 13:15
I think that values larger than 10 are awfully big, but that depends on your data of course. As for the `np.exp(x)` being zero for very small x, try this: `np.exp(np.asarray(-10*120, dtype=np.float128))` — Qusai Alothman, Aug 31 '18 at 12:00

Daniel F · Accepted Answer · 2018-09-07T05:34:40.937

Basically you're doing a softmax function, right?

The general way to prevent softmax from over/underflowing is (from here)

# Instead of this . . . 
def softmax(x, axis = 0):
    return np.exp(x) / np.sum(np.exp(x), axis = axis, keepdims = True)

# Do this
def softmax(x, axis = 0):
    e_x = np.exp(x - np.max(x, axis = axis, keepdims = True))
    return e_x / e_x.sum(axis, keepdims = True)

This bounds e_x between 0 and 1, and assures one value of e_x will always be 1 (namely the element np.argmax(x)). This prevents overflow and underflow (when np.exp(x.max()) is either bigger or smaller than float64 can handle).

In this case, as you can't change the algorithm, I would take the input D and make D_ = D - D.min() as this should be numerically equivalent to the above, as W.max() should be -gamma * D.min() (as you're just flipping the sign). The do your algorithm with regards to D_

EDIT:

As recommended by @PaulBrodersen below, you can build a "safe" rbf kernel based on the sklearn implementation here:

def rbf_kernel_safe(X, Y=None, gamma=None): 

      X, Y = sklearn.metrics.pairwise.check_pairwise_arrays(X, Y) 
      if gamma is None: 
          gamma = 1.0 / X.shape[1] 

      K = sklearn.metrics.pairwise.euclidean_distances(X, Y, squared=True) 
      K *= -gamma 
      K -= K.max()
      np.exp(K, K)    # exponentiate K in-place 
      return K

And then use it in your propagation

LabelPropagation(kernel = rbf_kernel_safe, tol = 0.01, gamma = 20).fit(X, Y)

Unfortunately I only have v0.18, which doesn't accept user-defined kernel functions for LabelPropagation, so I can't test it.

EDIT2:

Checking your source for why you have such large gamma values makes me wonder if you are using gamma = D.min()/3, which would be incorrect. The definition is sigma = D.min()/3 while the definition of sigma in w is

w = exp(-d**2/sigma**2)  # Equation (1)

which would make the correct gamma value 1/sigma**2 or 9/D.min()**2

Why should you not change the *implementation*? You could inherit from `LabelPropagation`, and overwrite `_get_kernel` to use your custom `rbf_kernel`, in which you add a line `K -= K.max()` after [line 843](https://github.com/scikit-learn/scikit-learn/blob/f0ab589f/sklearn/metrics/pairwise.py#L843) in the sklearn implementation. Disclaimer: haven't tested any of this. — Paul Brodersen, Sep 04 '18 at 16:19
Thanks for the edit, I was thinking on doing my own kernel. And don't worry, I used gamma = 9/D.min()**2, which is big if D.min() < 1 — politinsa, Sep 05 '18 at 20:58
Honestly the heuristic for `gamma` in that paper seems to be more of a WAG that worked for a given data set. `gamma >>1` is very quickly going to get you a binary `label_distributions` matrix (with only 1s and 0s) and if any row gets driven to all 0's (becasue one point is far from any others) your method will fail as you've described. — Daniel F, Sep 06 '18 at 05:58
Today I had time to try and the modified kernel allow me to work with bigger gamma than the classic rbf kernel. Thank's ! — politinsa, Sep 06 '18 at 10:17

LabelPropagation - How to avoid division by zero?

EDIT:

OLD POST (before edit)

So, how would one avoid this division by zero to get a proper propagation ?

Reproductible example:

1 Answers1

Linked