8

I'm using the scikit-learn method MDS to perform a dimensionality reduction in some data. I would like to check the stress value to access the quality of the reduction. I was expecting something between 0 - 1. However, I got values outside this range. Here's a minimal example:

%matplotlib inline

from sklearn.preprocessing import normalize
from sklearn import manifold
from matplotlib import pyplot as plt
from matplotlib.lines import Line2D

import numpy


def similarity_measure(vec1, vec2):
    vec1_x = numpy.arctan2(vec1[1], vec1[0])
    vec2_x = numpy.arctan2(vec2[1], vec2[0])
    vec1_y = numpy.sqrt(numpy.sum(vec1[0] * vec1[0] + vec1[1] * vec1[1]))
    vec2_y = numpy.sqrt(numpy.sum(vec2[0] * vec2[0] + vec2[1] * vec2[1]))

    dot  = numpy.sum(vec1_x * vec2_x + vec1_y * vec2_y)
    mag1 = numpy.sqrt(numpy.sum(vec1_x * vec1_x + vec1_y * vec1_y))
    mag2 = numpy.sqrt(numpy.sum(vec2_x * vec2_x + vec2_y * vec2_y))
    return dot / (mag1 * mag2)

plt.figure(figsize=(15, 15))

delta = numpy.zeros((100, 100))
data_x = numpy.random.randint(0, 100, (100, 100))
data_y = numpy.random.randint(0, 100, (100, 100))

for j in range(100):
    for k in range(100):
        if j <= k:
            dist = similarity_measure((data_x[j].flatten(), data_y[j].flatten()), (data_x[k].flatten(), data_y[k].flatten()))
            delta[j, k] = delta[k, j] = dist

delta = 1-((delta+1)/2)  
delta /= numpy.max(delta)

mds = manifold.MDS(n_components=2, max_iter=3000, eps=1e-9, random_state=0,
               dissimilarity="precomputed", n_jobs=1)
coords = mds.fit(delta).embedding_
print mds.stress_

plt.scatter(coords[:, 0], coords[:, 1], marker='x', s=50, edgecolor='None')
plt.tight_layout()

Which, in my test, printed the following:

263.412196461

And produced this image:

enter image description here

How can I analyze this value, without knowing the maximum value? Or how to normalize it, to have it between 0 and 1?

Thank you.

pceccon
  • 9,379
  • 26
  • 82
  • 158
  • I have the same problem, did you figure out the answer? Here it is also stated that it should be between 0 and 1 http://www.analytictech.com/borgatti/mds.htm – student Aug 19 '16 at 14:41
  • Hi, @student. Yes, I did solve this. As far as I remember, the "stress" function of this method is not normalised. You have to include the denominator (https://en.wikipedia.org/wiki/Multidimensional_scaling) to have it between 0 and 1. – pceccon Oct 03 '16 at 20:20

2 Answers2

5

It is because current scikit-learn's implementation computes and returns raw Stress value (σr) while you are expecting Stress-1 (σ1).

The former is not very informative (its high value does not necessarily indicate bad fit), and a better way of communicating reliability is to calculate a normed stress, eg. Stress-1 that according to Kruskal (1964, p. 3) has more or less the following interpretation: value 0 indicates perfect fit, 0.025 excellent, 0.05 good, 0.1 fair and 0.2 poor.

I just implemented calculation of Stress-1 and sent PR. In the meantime one can use version from this branch, where Stress-1 is used and returned instead of raw Stress when normalize parameter is set to True (False by default).

For more information cf. Kruskal (1964, p. 8-9) or Borg and Groenen (2005, p. 41-43).

1

While also searching for a Kruskal Stress, I found this french course of Ricco Rakotomalala. It contains an example of code that seems to calculate the correct Kruskal Stress :

import pandas
import numpy
from sklearn import manifold
from sklearn.metrics import euclidean_distances

## Input data format (file.csv) : dissimilarity matrix
#   ;  A  ;  B  ;  C  ;  D  ; E
# A ; 0   ; 0.9 ; 0.8 ; 0.5 ; 0.8
# B ; 0.9 ; 0   ; 0.7 ; 0   ; 1
# C ; 0.8 ; 0.7 ; 0   ; 0.2 ; 0.4
# D ; 0.5 ; 0   ; 0.2 ; 0   ; 0.8
# E ; 0.8 ; 1   ; 0.4 ; 0.8 ; 0


## Load data
data = pandas.read_table("file.csv", ";", header=0, index_col=0)

## MDS
mds = manifold.MDS(n_components=2, random_state=1, dissimilarity="precomputed")
mds.fit(data)
# Coordinates of points in the plan (n_components=2)
points = mds.embedding_

## sklearn Stress
print("sklearn stress :")
print(mds.stress_)
print("")

## Manual calculus of sklearn stress
DE = euclidean_distances(points)
stress = 0.5 * numpy.sum((DE - data.values)**2)
print("Manual calculus of sklearn stress :")
print(stress)
print("")

## Kruskal's stress (or stress formula 1)
stress1 = numpy.sqrt(stress / (0.5 * numpy.sum(data.values**2)))
print("Kruskal's Stress :")
print("[Poor > 0.2 > Fair > 0.1 > Good > 0.05 > Excellent > 0.025 > Perfect > 0.0]")
print(stress1)
print("")

Metalman
  • 73
  • 9
  • Thanks. This was quite helpful – Regi Mathew Nov 19 '20 at 14:10
  • hello, can I ask why the 0.5 is included? Its not in the link provided by @Lukasz Borchmann – Tim Kirkwood Nov 11 '21 at 13:05
  • I just copied/paste the code from a course, but in the page 5 and 6 of the course, you can see that the sum is equal to half another sum where j is never equal to i. I believe it's because the original calculus is working on the upper triangle of values... and the "simple" calculus use all of the matrix. Therefore, the final result must be divided by 2. – Metalman Nov 12 '21 at 16:13