Iteratively/sequentially drop and replace df variable and feed into function

Question

I am trying to assess the stability of a correlation analysis by iteratively dropping a variable, and re-running the analysis.

As I understand it, this requires me to (1) create matrices of length p-1, by iteratively/sequentially dropping a variable from a dataframe, (2) run a correlation function over a series of matrices, and (3) feed the output into a common dataframe or list, for subsequent analysis.

I am able to achieve each of these steps manually, as follows:


#required library for cc function
library(CCA)

#set seed
set.seed(123)

#X and Y dataframes
X_df <- data.frame(replicate(4,sample(1:10,10,rep=TRUE)))
Y_df <- data.frame(replicate(3,sample(1:10,10,rep=TRUE)))

#X and Y as scaled matrices
X <- scale(X_df)
Y <- scale(Y_df)

#manually omit a variable/column from the X df
X1 <- scale(X_df[2:4]) 
X2 <- scale(X_df[c(1, 3:4)]) 
X3 <- scale(X_df[c(1:2, 4)]) 
X4 <- scale(X_df[1:3]) 

#manually omit a variable/column from the Y df
Y1 <- scale(Y_df[2:3]) 
Y2 <- scale(Y_df[c(1, 3)]) 
Y3 <- scale(Y_df[1:2]) 

#perform canonical correlation - X sets and Y
cX1 <- cc(X1,Y)$cor
cX2 <- cc(X2,Y)$cor
cX3 <- cc(X3,Y)$cor
cX4 <- cc(X4,Y)$cor

#perform canonical correlation - Y sets and X
cY1 <- cc(X,Y1)$cor
cY2 <- cc(X,Y2)$cor
cY3 <- cc(X,Y3)$cor

#get canonical correlation values into a df
XVALS <- as.data.frame(rbind(cX1, cX2, cX3, cX4))
YVALS <- as.data.frame(rbind(cY1, cY2, cY3))

Of course, I know it's very bad to do this manually, and my real data is much larger.

Unfortunately, I am pretty new to R (and coding), and have been struggling to achieve any of these steps in a better way. I am familiar with the (existence of) the apply functions and also some functions in dplyr that I think are likely relevant (e.g., select) but I just can't get it to work despite reading documentation and seemingly similar posts for hours -- any guidance would be greatly appreciated.

Rui Barradas · Accepted Answer · 2020-01-20T08:13:22.700

Don't `scale`.

First of all, there is no need for scaled vectors as the code below shows.
The reason why vectors are scaled is a variant of R FAQ 7.31, see also this SO post. With older processors the precision loss was a real problem, leading to clearly wrong results. This is no longer true, at least not in the general case.

#perform canonical correlation - original X sets and Y
cX1b <- cc(X_df[2:4], Y)$cor
cX2b <- cc(X_df[c(1, 3:4)], Y)$cor
cX3b <- cc(X_df[c(1:2, 4)], Y)$cor
cX4b <- cc(X_df[1:3], Y)$cor

XVALSb <- as.data.frame(rbind(cX1b, cX2b, cX3b, cX4b))

XVALS and XVALSb row names are different, make them equal in order to please all.equal().

row.names(XVALS) <- 1:4
row.names(XVALSb) <- 1:4

The results are not exactly equal but are within floating-point accuracy. In this case I'm testing equality with all.equal's default of .Machine$double.eps^0.5.

identical(XVALS, XVALSb)
#[1] FALSE

all.equal(XVALS, XVALSb)
#[1] TRUE

XVALS - XVALSb
#             V1            V2           V3
#1  0.000000e+00  1.110223e-16 0.000000e+00
#2 -1.110223e-16  1.110223e-16 5.551115e-17
#3  1.110223e-16 -2.220446e-16 2.220446e-16
#4  1.110223e-16  4.440892e-16 1.110223e-16

The question.

To get all combinations of columns leaving one out there is function combn. Function cc_df_one_out first calls combn on each of its arguments then apply to those indices an anonymous function computing CCA::cc.

Note that the rows order is not the same as in your posted example, since combn does not follow your order of column indices.

cc_df_one_out <- function(X, Y){
  f <- function(x) combn(ncol(x), ncol(x) - 1)

  X_inx <- f(X)
  Y_inx <- f(Y)

  ccX <- t(apply(X_inx, 2, function(i) cc(X[, i], Y)$cor))
  ccY <- t(apply(Y_inx, 2, function(i) cc(X, Y[, i])$cor))

  list(XVALS = as.data.frame(ccX), YVALS = as.data.frame(ccY))
}

cc_df_one_out(X_df, Y_df)
#$XVALS
#         V1        V2        V3
#1 0.8787169 0.6999526 0.5073979
#2 0.8922514 0.7244302 0.2979096
#3 0.8441566 0.7807032 0.3331449
#4 0.9059585 0.7371382 0.1344559
#
#$YVALS
#         V1        V2
#1 0.8975949 0.7309265
#2 0.8484323 0.7488632
#3 0.8721945 0.7452478

Wow, thank you so much! I was really struggling to achieve this, but now I understand how you did. Thanks also for the information about scale! — tryingToImprove, Jan 20 '20 at 09:22

Iteratively/sequentially drop and replace df variable and feed into function

1 Answers1

Don't scale.

The question.

Don't `scale`.