How to rearrange your data in an array for PARAFAC model from PTAK package in R

Question

I was originally running a PCA to reduce a large number of correlated measures (>10 behaviours) down to fewer variables (in PCA I used the first and second principal components). But this is not appropriate (similar situation to this OP) because we have repeated measures from the same individuals over time (Budaev 2010, pg. 6: "Using multiple measures from the same individuals as independent while computing the correlation matrix is pseudoreplication and is incorrect."). Because of this, it is recommended I use a PARAFAC model instead of PCA to do this (available through the PTAk package in R) - see Leibovici (2010) for details.

My data is stored as a data.frame object, where each row is for one individual, that can be sampled multiple times in a year and across their lifetimes.

Sample of my data (data available here):

individual  beh1   beh2     beh3   beh4    year
11979       0      0.0333   0      0       2014
12026       0.176  0.0882   0.441  0.0882  2014
12435       0.405  0.189    0      0.243   2014
12524       0      0        1      0       2014
12625       0      0        0      0       2014
12678       0      0        0      0       2014

To use the PTAk package, the data needs to be converted into an array. The code to do this is:

my_df <- array(as.vector(as.matrix(subset_data), c(x, y, z))

where x is the number of rows, y is the number of columns, and z is the number of arrays.

My general question:

Which components of my data.frame should correspond to which measures in the array?

My initial guess would be that x should correspond to the number of individuals sampled (i.e., the number of rows in the original data.frame), but I am not sure what the y and z components should be.

Like this:

my_df <- array(as.vector(as.matrix(subset_data)), c(5393, 4, 9))

where x is 5393 individuals, y is the number of variables (e.g., 4 behaviours), and z is the number of years (9 years).

This generates 9 arrays with each individual’s record as the rows, and each variable as a column (identifier, 4 behaviours, and the year of sampling). In theory each array would correspond to a certain year of sampling, but that is currently not the case.

My question in detail:

If this is the correct formatting for my array, how do I ensure that only one year of sampling data is included in each array (i.e., only samples from 2008 are in array 1, only 2009 in array 2, etc.)?

Alternatively, if my formatting is wrong, what is the correct array format for my data and question?

For example, should I group the data into arrays according to the behaviour (beh1, beh2, etc.), so the code looks like:

my_df<-array(as.vector(as.matrix(subset_data)), c(5393, 3, 4))

where there would be three columns per array corresponding to the identifier, value for the behaviour, and year of observation? If this is the proper formatting, how would I ensure that the arrays are divided based on the behaviours rather than the identifier and/or year columns?

score 1 · Accepted Answer · answered Jun 16 '21 at 15:48

1

First of all in your subset_data the variable individual and year need to be discarded (or used in rownames) as they are just identifiers, otherwise in your 'as.vector(subset_data)' they would mixed them up with the data: so use as.vector(subset_data[,-c(1,4)])

Then, look at the little example below: A=matrix(1:6,c(2,3))

as.vector(A)is [1] 1 2 3 4 5 6

So, imagine 2 individuals 3 behaviours that works!

In building A, dim(A)[1] is (2) runs faster than dim(A)[2] (3), which extends to arrays.

So now imagine have 4 years X[,,1] is your first year A: X<-array(0,c(2,3,4)); X[,,1]=A; X[,,2]=A*2; X[,,3]=A*10, X[,,4]=A/10

Note this could be a way of building your my_df

my_df[,,1]<-subset_data[ subset_data[,4]==2014, -c(1,4) ]etc.

My point was as.vector(X)is then

1 2 3 4 5 6 2 4 6 8 10 12 ...

so the first year then the second year etc...

So to come back (or in fact start of ) with a matrix ind x variable you'll need to permute the data to AA=matrix(aperm(X,c(1,3,2)),c(8,3)) basically 8 is 2 individuals times 4 with 3 variables...

So if you start with that matrix AA your array will be Array(AA,dim=c(2,4,3)) individual x year x var

So with: AA=subset_data[,-c(1,4)]

you'll need to say array(AA,dim=c(nb_indi_repeated,9,4)) for 9 years and 4 variables .... but 5393/9 looks like you do not have full exact repetition for all individuals. So you'll need either to select the 'best sample' of the repeated individuals to define the years and the selected individuals or estimate the missing values or do something completely different! This could be defining a repetition not from years but from the series of repeated measures, the next one being either in the same year or later ...

answered Jun 16 '21 at 15:48

DidierL

26
5

A few questions... (a) In the first example, the `A matrix` is split into arrays by year (i.e., 4). However, the `AA matrix` splits the data into arrays by variables (i.e., 3). Given that part of my question was which way should I be separating my data, which of these examples is more correct? (b) Can I not just use the `data.frame` that I already have (minus the identifier and year columns) and directly translate that into an `array`? Or is it necessary to rearrange my `data.frame` into a permuted `matrix` format? – Blundering Ecologist Jun 19 '21 at 04:18
(c) In the code `AA=matrix(aperm(X, c(1,3,2)),c(8,3))`, can you explain what the `c(1, 3, 2)` terms do? I understand that the `c(8,3)` creates a `matrix` of 8 rows with 3 columns, but I do not understand where the `c(1,3,2)` values are coming from, nor what aspect of the `matrix` they apply to. (d) I ideally want to keep all individuals in my analysis. However, sampling was inconsistent, with a majority being sampled only once, and the minority being sampled two or more times. Can I keep individuals who were only sampled once in the analysis, or am I forced to exclude them? – Blundering Ecologist Jun 19 '21 at 04:20
(e) Is it mandatory that I keep the same dimensions for all `arrays`, or can I mix and match? I.e., if 200 individuals were sampled once, 78 were sampled twice, and 25 were sampled 3 times, is there a way to account for the imbalanced sampling, or would I need to restrict all arrays to 25 total individuals? – Blundering Ecologist Jun 19 '21 at 04:20
(a) that was my point. To use `AA=subset_data[,-c(1,4)]`directly the array has as dimension `n 9 4` where is `n`would be the number of individuals repeated over the 9 years. So this is fine, i.e. working with the array `n 9 4` or `n 4 9` is obviously equivalent, it is just a matter of reading the data the "right order". your `my_df`is not reading the data in the "right order" for an array `n 4 9' that would mix up things .... and if you prefer to work with an array `n 4 9` you do a permutation, an `aperm()` ... – DidierL Jun 20 '21 at 12:18
(c) look into `help(aperm)` for example if ` dim(X)`is `2 3 4` `> dim(aperm(X,c(1,3,2)))`is `2 4 3` – DidierL Jun 20 '21 at 12:22
(d) If you want to keep a "repeated" data table you'll need to replace the "missing" repetition for an individual for a specific year, so a strategy for this, e.g. the average of all other individuals with no missing in that year or all "very similar" (to be defined) to that individual. This is acceptable if you have not many missing information ... of course you do a PCA instead and for the plot of individual you use as labels the years to have an idea of a year effect! – DidierL Jun 20 '21 at 12:28
(e) perhaps you can a PCA of the 200 then a PTAK of the 78 x 4 var x 2 years and a PTAk of the 25 x 4 x 3 and compare looking for any consistencies and/or year effect? – DidierL Jun 20 '21 at 12:31

How to rearrange your data in an array for PARAFAC model from PTAK package in R

1 Answers1