I'd like to run a statistical test, row-by-matching-row, between two data frames gex
and mxy
. The catch is that I need to run it several times, each time using a different column from gex
, yielding a different vector of test results for each run.
Here is what I have so far (using example values), after much help from @kristang.
gex <- data.frame("sample" = c(987,7829,15056,15058,15072),
"TCGA-F4-6703-01" = runif(5, -1, 1),
"TCGA-DM-A28E-01" = runif(5, -1, 1),
"TCGA-AY-6197-01" = runif(5, -1, 1),
"TCGA-A6-5657-01" = runif(5, -1, 1))
colnames(gex) <- gsub("[.]", "_",colnames(gex))
listx <- c("TCGA_DM_A28E_01","TCGA_A6_5657_01")
mxy <- data.frame("TCGA-AD-6963-01" = runif(5, -1, 1),
"TCGA-AA-3663-11" = runif(5, -1, 1),
"TCGA-AD-6901-01" = runif(5, -1, 1),
"TCGA-AZ-2511-01" = runif(5, -1, 1),
"TCGA-A6-A567-01" = runif(5, -1, 1))
colnames(mxy) <- gsub("[.]", "_",colnames(mxy))
zScore <- function(x,y)((as.numeric(x) - as.numeric(rowMeans(y,na.rm=T)))/as.numeric(sd(y,na.rm=T)))
## BELOW IS FOR DIAGNOSTICS
write.table(mxy, file = "mxy.csv",
row.names=FALSE, col.names=TRUE, sep=",", quote=F)
write.table(gex, file = "gex.csv",
row.names=FALSE, col.names=TRUE, sep=",", quote=F)
## ABOVE IS FOR DIAGNOSTICS
for(i in seq(nrow(mxy)))
for(colName in listx){
zvalues <- zScore(gex[,colName[colName %in% names(gex)]],
mxy[i,])
## BELOW IS FOR DIAGNOSTICS
write.table(gex[,colName[colName %in% names(gex)]], file=paste0(colName, "column", ".csv"),
row.names=FALSE,col.names=FALSE,sep=",",quote=F)
write.table(mxy[i,], file=paste0(colName, "mxyinput", ".csv"),
row.names=FALSE,col.names=FALSE,sep=",",quote=F)
## ABOVE IS FOR DIAGNOSTICS
geneexptest <- data.frame(gex$sample, zvalues, row.names = NULL,
stringsAsFactors = FALSE)
write.csv(geneexptest, file = paste0(colName, ".csv"),
row.names=FALSE, col.names=FALSE, sep=",", quote=F)
}
The problem is that while it seems to go through and create the correct number of output files with the correct number of rows, etc...but it does not yield correct z-scores. I want it to calculate:
((Value from row z & given column of gex) - (Mean of values in row z across mxy)) / (Standard deviation of values in row z across mxy)
Then move on to the next row, and so on, filling in the first vector. THEN, I want it to calculate the same thing using the next column of gex, filling in a separate vector. I hope this makes sense.
I have a separate script which runs the same test using a pre-determined column vs the other data frame. The relevant for loop from that script looks like this:
for(i in seq_along(mxy)){
zvalues[i] <- (gex_column_W[i] - mean(mxy[i,])) / sd(mxy[i,])
}