2

I am trying to visualize my high dimensional data set in two axis or components using nonmetric multi-dimensional scaling. This function is available in scikit-learn library. Here is my code:

from sklearn.manifold import MDS 

embedding = MDS (n_components=2, metric= False, n_init=2, max_iter=100, 
                 verbose=0, eps=0.001, n_jobs=2, random_state=101
                ,dissimilarity='euclidean')
#precip=precip[0:100]

precip_transformed = embedding.fit_transform(precip)
precip_transformed

The default values for n_init is 4 and max_iter is 300 and n_jobs=None (which means -1). This takes forever to run even though I reduced the default values and increased the n_jobs. It also makes my notebook crash after a while. I should mention that my data has 20000 rows and when I keep the commented out line of the code (only 100 rows), it works. Does anyone know how I can make this work? faster or some way to make sure the notebook won't crash.

ilearn
  • 183
  • 3
  • 16
  • Check your memory and if not enough, set njobs to 1. – sascha Jan 30 '19 at 18:06
  • Did you scale your features? That can affect performance on many dimensionality-reduction techniques – G. Anderson Jan 30 '19 at 18:29
  • Instead of just selecting first 100 rows you can sample randomly. But algorithm has O(n^3) complexity, so you still can't use a lot instances. – hellpanderr Jan 30 '19 at 18:59
  • As all those 20,000 rows are the watersheds I am analyzing I cannot reduce - I just wanted to check if it will work on 100 rows- So, you think this computation will not be possible for all 20,000 rows? @hellpanderr – ilearn Jan 30 '19 at 19:11
  • @G.Anderson They are all have same scale or units- so, no I did not scale them – ilearn Jan 30 '19 at 19:12
  • I haven't used it by there is a package called `megaman` https://arxiv.org/pdf/1603.02763.pdf that is supposed to be able to handle bigger datasets – hellpanderr Jan 30 '19 at 19:43
  • Also MDS is pretty slow, have you tried other methods from `manifold`? – hellpanderr Jan 30 '19 at 19:45
  • @hellpanderr The reason I have to use Nonmetric multi-dimensional scaling is this recommendation that I should follow: NMS is generally regarded as the most effective ordination method for ecological community data as it is well suited to non-normal and categorical data - – ilearn Jan 30 '19 at 20:00
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/187602/discussion-between-sina-shabani-and-hellpanderr). – ilearn Jan 30 '19 at 20:06

0 Answers0