I was originally running a PCA to reduce a large number of correlated measures (>10 behaviours) down to fewer variables (in PCA I used the first and second principal components). But this is not appropriate (similar situation to this OP) because we have repeated measures from the same individuals over time (Budaev 2010, pg. 6: "Using multiple measures from the same individuals as independent while computing the correlation matrix is pseudoreplication and is incorrect."). Because of this, it is recommended I use a PARAFAC
model instead of PCA to do this (available through the PTAk
package in R) - see Leibovici (2010) for details.
My data is stored as a data.frame
object, where each row is for one individual, that can be sampled multiple times in a year and across their lifetimes.
Sample of my data (data available here):
individual beh1 beh2 beh3 beh4 year
11979 0 0.0333 0 0 2014
12026 0.176 0.0882 0.441 0.0882 2014
12435 0.405 0.189 0 0.243 2014
12524 0 0 1 0 2014
12625 0 0 0 0 2014
12678 0 0 0 0 2014
To use the PTAk
package, the data needs to be converted into an array
. The code to do this is:
my_df <- array(as.vector(as.matrix(subset_data), c(x, y, z))
where x
is the number of rows, y
is the number of columns, and z
is the number of arrays.
My general question:
Which components of my
data.frame
should correspond to which measures in thearray
?
My initial guess would be that x
should correspond to the number of individuals sampled (i.e., the number of rows in the original data.frame
), but I am not sure what the y
and z
components should be.
Like this:
my_df <- array(as.vector(as.matrix(subset_data)), c(5393, 4, 9))
where x
is 5393 individuals, y
is the number of variables (e.g., 4 behaviours), and z
is the number of years (9 years).
This generates 9 arrays
with each individual’s record as the rows, and each variable as a column (identifier, 4 behaviours, and the year of sampling). In theory each array would correspond to a certain year of sampling, but that is currently not the case.
My question in detail:
If this is the correct formatting for my
array
, how do I ensure that only one year of sampling data is included in each array (i.e., only samples from 2008 are inarray
1, only 2009 inarray
2, etc.)?
Alternatively, if my formatting is wrong, what is the correct array
format for my data and question?
For example, should I group the data into arrays
according to the behaviour (beh1
, beh2
, etc.), so the code looks like:
my_df<-array(as.vector(as.matrix(subset_data)), c(5393, 3, 4))
where there would be three columns per array
corresponding to the identifier, value for the behaviour, and year of observation? If this is the proper formatting, how would I ensure that the arrays
are divided based on the behaviours rather than the identifier and/or year columns?