2

I would like to perform a three-way principal component analysis in R, and though I have found a few articles explaining how it works and how to interpret results, I cannot find any useful guides online on how to do it in R.

My data consists of 230 samples, 250,000 variables and 50 annotations. Usually people just do a standard PCA using only one annotation on the following type of data:

Standard data:

        var1 var2 var3 var4
Sample1  1/1  0/0  1/1  1/0
Sample2  1/0  1/1  1/1  1/0
Sample3  0/0  1/1  1/1  1/1
Sample4  0/0  0/0  1/1  0/0
Sample5  1/0  1/0  0/0  1/1

However, I would like to implement all annotation information into the analysis, such that I use all 50 matrices combindly for the analysis. In this way a combination of annotations may explain more of the variance between samples than a single annotation does alone, e.g. annotation 1 and 4 together explain more variance that annotation 1 alone.

Annotation 1:

         var1 var2 var3 var4
Sample1  1/1  0/0  1/1  1/0
Sample2  1/0  1/1  1/1  1/0
Sample3  0/0  1/1  1/1  1/1
Sample4  0/0  0/0  1/1  0/0
Sample5  1/0  1/0  0/0  1/1

Annotation 2:

         var1      var2      var3  var4
Sample1  missense  none      STOP  synonymous
Sample2  missense  missense  STOP  synonymous
Sample3  none      missense  STOP  synonymous
Sample4  none      none      STOP  none
Sample5  missense  missense  none  synonymous

Annotation 3:

         var1  var2   var3  var4
Sample1  0.30   0.00  0.01  0.04
Sample2  0.30  -0.24  0.01  0.04
Sample3  0.00  -0.24  0.01  0.04
Sample4  0.00  -0.24  0.01  0.00
Sample5  0.30  -0.24  0.00  0.04

Annotation 4:

         var1  var2  var3  var4
Sample1  CTCF  NONE  NONE  MAX
Sample2  CTCF  NONE  NONE  MAX
Sample3  NONE  NONE  NONE  MAX
Sample4  NONE  NONE  NONE  NONE
Sample5  CTCF  NONE  NONE  MAX

From what I have found there are three packages that can do the Tucker 3-way PCA: ThreeWay, PTAk and rTensor. I have tried to run ThreeWay, but the data structure that they use seems very ugly to work with. Maybe I could make this work, but the example in the ThreeWay article also generated an error, so I would prefer another package:

ThreeWay data structure:

         var1_anno1    var1_anno2    var1_anno3   var2_anno1    var2_anno2
Sample1  1/1           missense      0.30         0/0           missense
Sample2  1/0           missense      0.30         1/1           missense
Sample3  0/0           none          0.30         1/1           missense
Sample4  0/0           none          0.30         0/0           none
Sample5  1/0           missense      0.30         1/0           missense

The PTAk packages requires: "a tensor (as an array) of order k, if non-identity metrics are used X is a list with data as the array and met a list of metrics"

It is not clear to me what this means. I tried to look into the tensor packages of how to generate a tensor, but their example is very convoluted as they do tons of multiplications on various tensors, rather than explain the basics of how to create a tensor form ones data.

I would appreciate both comments on weaknesses of this approach and on how to create tensors, and how to analyse them using any of the packages.

Thanks

Esben Eickhardt
  • 3,183
  • 2
  • 35
  • 56

2 Answers2

0

I ended up using the PTAk package for running the analysis.

To build the tensors, I used the two packages tensor and abind.

I built the tensors (aka Multi-way Arrays) by creating a vector from a matrix, and then re-defining its dimensions in three dimensions. The function abind() was then used to merge the arrays from each individual into the final three-dimensional tensor.

for (i in 1:length(list_of_sample_matrices)) {  

    # Converting matrix into single sample tensor
    single_sample_tensor <- array(as.vector(list_of_sample_matrices[i])), c(250000, 50, 1))
    
    # Creating all sample tensor
    if (i == 1) {
        all_sample_tensor <- single_sample_tensor
    }

    # Adding a single sample tensor at the time to the all sample tensor
    if (i > 1) {
        all_sample_tensor <- abind(all_sample_tensor, single_sample_tensor)
    }
}
Blundering Ecologist
  • 1,199
  • 2
  • 14
  • 38
Esben Eickhardt
  • 3,183
  • 2
  • 35
  • 56
0

For all functions PTAk(), FCAk(), PCAn() and CANDPARA() (for CANDECOMP or PARAFAC) the data input is an array generated more likely from the function array().

Preparing the data read by array(), you need to remember there is (like for matrix(), for rows and columns) an order of indices the first running faster than the next etc...

So, using as.vector(), cbind(), rbind(), and all other data manipulation can prepare the data to be read by array(), and then possibly also using abind() (package abind) to combine some arrays.

Some examples are given in the JSS paper

e.g. abind(x1,x2,x3,x4,...xp,along=3) is indeed quite convenient to create a 3-way array n x q x p from the series of p matrices x1,x2, ... ,xp of dimensions n x q.

Blundering Ecologist
  • 1,199
  • 2
  • 14
  • 38
DidierL
  • 26
  • 5