4

I'm hoping this is the correct place to post - if not, I am willing to change to SO.

In any case, I am using MDS to help me find a 2-D representation of a dataset. Essentially, these are pKa values of amino acid residues across many years' worth of protein data - decimal numbers of the same scale, at its core. There are many positions (~600 rows), and there are many years (~12 columns).

My question is this: is the correct input to MDS the data matrix (years vs positions), or can I put in the correlation matrix (year vs year)? I ask because the API docs conflict with the written description.

API docs say data matrix: http://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html#sklearn.manifold.MDS (i.e. n_samples, n_features).

Written description says "the input similarity matrix": http://scikit-learn.org/stable/modules/manifold.html

ericmjl
  • 13,541
  • 12
  • 51
  • 80

1 Answers1

10

If you pass dissimilarity='euclidean' to the initial estimator (or by default), it will take a data matrix and compute the Euclidean distance matrix for you.

If you pass dissimilarity='precomputed', it takes a dissimilarity matrix.

The docs are indeed not super-clear on this, though; I'm sure a pull request adding a brief note to the description of the X argument, and clarifying that 'euclidean' is the default (I had to check the source), would be accepted.

Danica
  • 28,423
  • 6
  • 90
  • 122
  • 1
    Many thanks, @Dougal! I still need to wait 6 more mins. to accept your answer. :-) – ericmjl Aug 07 '14 at 21:10
  • What should be the entry `(i,j)` in Euclidean distance matrix computed from the data matrix (for example, 7 rows and 3 columns)? – Sigur Aug 30 '17 at 23:26
  • 1
    @Sigur If you have a data matrix of shape `(7, 3)`, that means in scikit-learn that you have 7 input points, with 3-dimensional features. If you're using `dissimilarity='precomputed'`, `dissim[i, j]` should be the dissimilarity between the `i`th and the `j`th input points, e.g. `np.linalg.norm(X[i] - X[j])`. Note that [`sklearn.metrics.pairwise.euclidean_distances`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html#sklearn.metrics.pairwise.euclidean_distances) will compute Euclidean distances for you. – Danica Aug 30 '17 at 23:29
  • So, the data is read as points for each row and the dimension comes from the number of columns! So, if I transpose, the situation is completely different. Thanks so much. – Sigur Aug 30 '17 at 23:40
  • 1
    @Sigur Yeah, that's the standard in scikit-learn: see e.g. "shape of the data arrays" in [this section of the tutorial](http://scikit-learn.org/stable/tutorial/basic/tutorial.html#loading-example-dataset). – Danica Aug 30 '17 at 23:42