3

I am working with large datasets of protein-protein similarities generated in NCBI BLAST. I have stored the results in a large pairwise matrices (25,000 x 25,000) and I am using multidimensional scaling (MDS) to visualize the data. These matrices were too large to work with in RAM so I stored them on disk in HDF5 format and accessed them with the h5py module.

The sklearn manifold MDS method generated great visualization for small-scale data in 3D, so that is the one I am currently using. For the calculation, it requires a complete symmetric pairwise dissimilarity matrix. However, with large datasets, a sort of "crust" is formed that obscures the clusters that have formed.

http://imgur.com/XkpoOJ4

I think the problem is that I am required to input a complete dissimilarity matrix. Some proteins are not related to each other, but in the pairwise dissimilarity matrix, I am forced to input a default max value of dissimilarity. In the documentation of sklearn MDS, it says that a value of 0 is considered a missing value, but inputting 0 where I want missing values does not seem to work.

Is there any way of inputting an incomplete dissimilarity matrix so unrelated proteins don't have to be inputted? Or is there a better/faster way to visualize the data in a pairwise dissimilarity matrix?

datadude
  • 33
  • 1
  • 5
  • 1
    Why do you think using missing values doesn't work? The MDS implementation is done with missing values in mind. If you have all values, you could just use PCA. FYI: there is currently work on an online PCA in scikit-learn, which would allow you to do PCA on all of the data, even though it does not fit into ram. – Andreas Mueller Aug 03 '14 at 13:36
  • Maybe I just have a limited understanding of MDS. I thought the goal of MDS was to take distances in high dimensions and find out the best way to preserve and represent them in low dimensions. I don't see how this can be done with missing values. Thank you for letting me know about the new work in PCA, I will definitely check it out! – datadude Aug 03 '14 at 16:32
  • The missing values are assumed unobserved. – Andreas Mueller Aug 04 '14 at 17:03

2 Answers2

1

MDS requires a full dissimilarity matrix AFAIK. However, I think it is probably not the best tool for what you plan to achieve. Assuming that your dissimilarity matrix is metric (which need not be the case), it surely can be embedded in 25,000 dimensions, but "crushing" that to 3D will "compress" the data points together too much. That results in the "crust" you'd like to peel away.

I would rather run a hierarchical clustering algorithm on the dissimilarity matrix, then sort the leaves (i.e. the proteins) so that the similar ones are kept together, and then visualize the dissimilarity matrix with rows and columns permuted according to the ordering generated by the clustering. Assuming short distances are colored yellow and long distances are blue (think of the color blind! :-) ), this should result in a matrix with big yellow rectangles along the diagonal where the similar proteins cluster together.

You would have to downsample the image or buy a 25,000 x 25,000 screen :-) but I assume you want to have an "overall" low-resolution view anyway.

András Aszódi
  • 8,948
  • 5
  • 48
  • 51
  • Thanks for the response! I have three questions. 1. Are there other clustering approaches that I should consider? I like the simplicity of your approach, but as a data visualization tool, MDS looks a lot cooler. Of course, the most important thing is accurate clustering, but if you have a suggestion for another clustering algorithm like MDS, that would be awesome. 2. What is the algorithm with the lowest computational complexity that I can use for a program like this? I know that MDS takes forever (many hours, days) 3. Is there an algorithm that supports interpolating new data easily? – datadude Aug 02 '14 at 05:12
  • 1
    1a) There are lots of clustering algorithms, because there is no single "best" way of clustering. Depends on the problem at hand, and there are also subjective elements (e.g. how you define the dissimilarity metric). 1b) MDS is _not_ a clustering algorithm, it is a dimensionality reduction approach. Because of the "crushing" I explained in my post, it may bring items "close together" which actually do not belong together. 2) Expect O(N^2) complexity because you have to process pairwise data. The matrix diagonalization in MDS is O(N^3), that's why it is slow. 3) I don't understand this. – András Aszódi Aug 04 '14 at 10:03
  • I ran out of characters in my previous comment ;-) One last remark: if you find my answer useful, why not giving it a +1? :-))) – András Aszódi Aug 04 '14 at 10:07
  • To clarify 3, I mean to say is there a way to plot new data into an existing visualization without rerunning the entire clustering algorithm. E.g. if I had an existing plot of 25000 proteins, is there an easy way to add 5 more points without running clustering with 25005 proteins. As for your remark, I definitely would but unfortunately I have to have 15 reputation to give a +1, sorry about that! – datadude Aug 04 '14 at 19:20
0

There are many algorithms under the name nonlineaer dimentionality reduction. You can find a long list of those algorithms on wikipedia, most of them are developed in recent years. If PCA doesn't work well for your data, I would try the method CCA or tSNE. The latter is especially good to show cluster structures.

James LI
  • 133
  • 1
  • 8