2

Looking at the parameters to the Rtsne function:

https://cran.r-project.org/web/packages/Rtsne/Rtsne.pdf

There is a parameter called "pca" defined as "logical; Whether an initial PCA step should be performed (default: TRUE)"

Let's say you have a 10 dimensional feature set and you run TSNE. I was thinking you would scale the 10-D matrix and then pass it to Rtsne().

What does the pca indicated by the pca parameter do?

WOuld it take the 10-D matrix and run PCA on that? If so, would it pass all 10 dimensions in the PCA space to Rtsne?

Is there any info anywhere else about what this initial PCA step is?

Thank you.

user3022875
  • 8,598
  • 26
  • 103
  • 167

1 Answers1

2

The original tSNE paper used PCA.

To reduce the dimensionality of the MNIST data prior to running tSNE.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Ah, good to know! I looked it up, they write: "In all of our experiments, we start by using PCA to reduce the dimensionality of the data to 30. This speeds up the computation of pairwise distances between the datapoints and suppresses some noise without severely distorting the interpoint distances. " It might be useful to be able to tweak the magical number 30 for complex data sets? For large data sets, the PCA step apparently takes much more time than the tSNE step. I guess the PCA argument is there so that you can skip it if your data already happens to be PCA-transformed. – plijnzaad Oct 16 '18 at 16:19
  • PCA should be much cheaper than tSNE, unless you have so many dimension that the curse of dimensionality ruins distances anyway. – Has QUIT--Anony-Mousse Oct 17 '18 at 06:46
  • t-SNE is used succesfully very often in the field of single-cell transcriptomics, where you typically have ~ 20,000 dimensions to reduce, but with many of them being 0 or colinear, so the curse of dimenstionality isn't too bad. There, SVD is the more time consuming step, see e.g. https://github.com/jkrijthe/Rtsne/issues/26 – plijnzaad Oct 17 '18 at 09:23
  • In such cases it may be worth skipping PCA obviously because the PCA results are not reliable if you don't have n>>p... plus, you may want to use a more appropriate similarity measure... – Has QUIT--Anony-Mousse Oct 18 '18 at 07:47