0

I'd like to run a statistical test, row-by-matching-row, between two data frames gex and mxy. The catch is that I need to run it several times, each time using a different column from gex, yielding a different vector of test results for each run.

Here is what I have so far (using example values), after much help from @kristang.

gex <- data.frame("sample" =  c(987,7829,15056,15058,15072), 
                  "TCGA-F4-6703-01" = runif(5, -1, 1),
                  "TCGA-DM-A28E-01" = runif(5, -1, 1),
                  "TCGA-AY-6197-01" = runif(5, -1, 1),
                  "TCGA-A6-5657-01" = runif(5, -1, 1))
colnames(gex) <- gsub("[.]", "_",colnames(gex))

listx <- c("TCGA_DM_A28E_01","TCGA_A6_5657_01")

mxy <- data.frame("TCGA-AD-6963-01" = runif(5, -1, 1),
                  "TCGA-AA-3663-11" = runif(5, -1, 1),
                  "TCGA-AD-6901-01" = runif(5, -1, 1),
                  "TCGA-AZ-2511-01" = runif(5, -1, 1),
                  "TCGA-A6-A567-01" = runif(5, -1, 1)) 

colnames(mxy) <- gsub("[.]", "_",colnames(mxy))

zScore <- function(x,y)((as.numeric(x) - as.numeric(rowMeans(y,na.rm=T)))/as.numeric(sd(y,na.rm=T)))

## BELOW IS FOR DIAGNOSTICS

write.table(mxy, file = "mxy.csv", 
            row.names=FALSE, col.names=TRUE, sep=",", quote=F)

write.table(gex, file = "gex.csv", 
            row.names=FALSE, col.names=TRUE, sep=",", quote=F)

## ABOVE IS FOR DIAGNOSTICS

for(i in seq(nrow(mxy)))
  for(colName in listx){

    zvalues <- zScore(gex[,colName[colName %in% names(gex)]],
                      mxy[i,])

    ## BELOW IS FOR DIAGNOSTICS

    write.table(gex[,colName[colName %in% names(gex)]], file=paste0(colName, "column", ".csv"),
                row.names=FALSE,col.names=FALSE,sep=",",quote=F)

    write.table(mxy[i,], file=paste0(colName, "mxyinput", ".csv"),
                row.names=FALSE,col.names=FALSE,sep=",",quote=F)

    ## ABOVE IS FOR DIAGNOSTICS

    geneexptest <- data.frame(gex$sample, zvalues, row.names = NULL, 
                              stringsAsFactors = FALSE)
    write.csv(geneexptest, file = paste0(colName, ".csv"), 
              row.names=FALSE, col.names=FALSE, sep=",", quote=F)
  }

The problem is that while it seems to go through and create the correct number of output files with the correct number of rows, etc...but it does not yield correct z-scores. I want it to calculate:

((Value from row z & given column of gex) - (Mean of values in row z across mxy)) / (Standard deviation of values in row z across mxy)

Then move on to the next row, and so on, filling in the first vector. THEN, I want it to calculate the same thing using the next column of gex, filling in a separate vector. I hope this makes sense.

I have a separate script which runs the same test using a pre-determined column vs the other data frame. The relevant for loop from that script looks like this:

for(i in seq_along(mxy)){
  zvalues[i] <- (gex_column_W[i] - mean(mxy[i,])) / sd(mxy[i,])
}
Henri Wathieu
  • 121
  • 3
  • 14
  • 1
    Are you sure about `for(j in length(listx))`? Shouldn't it be `for(j in 1:length(listx))`? – nicola Feb 02 '15 at 20:03
  • @nicola thanks for bringing that to my attention. I edited the above script. Please take a look if you get the chance! Thanks. – Henri Wathieu Feb 02 '15 at 22:11

1 Answers1

0

I think there may be a typo in your code, specifically you say you want "Mean of values in row z across mxy" but are using the mean(mxy[,i])) which selects the i'th column, not the i'th row. I re-wrote this section with for loops for clarity. (not sure why you were using lapply?)

# a function fo calculationg the z score
zScore <- function(x,y)(x - mean(y,na.rm=T))/sd(y,na.rm=T)

for(i in seq(nrow(mxy))) # note that length(mxy) is actually the number of columns in mxy
for(colName in listx){
    zvalues <- zScore(gex[,colName],# column == colName
                      mxy[i,])# row == i
    geneexptest <- data.frame(gex$sample, zvalues, row.names = NULL, 
                          stringsAsFactors = FALSE)
    write.table(geneexptest, file = paste0(colName, "mxyinput", ".csv"),
                row.names=FALSE, col.names=FALSE,  quote=F,
                sep = ",", dec = ".", append=(i > 1))

}

and alternative that does not rely on append:

for(colName in listx){
    geneexptest <- NULL
    for(i in seq(nrow(mxy))) {
        zvalues <- zScore(gex[,colName],# column == colName
                          mxy[i,])# row == i
        geneexptest <- rbind(geneexptest,
                            data.frame(gex$sample, zvalues, row.names = NULL, 
                              stringsAsFactors = FALSE))
    }
    write.table(geneexptest, file = paste0(colName, "mxyinput", ".csv"),
                row.names=FALSE, col.names=FALSE,  quote=F,
                sep = ",", dec = ".", append=(i > 1))
}
Jthorpe
  • 9,756
  • 2
  • 49
  • 64
  • thanks for your feedback. Definitely an improvement. However, using `mean(y)` in the function `zScore` yields `NA`s in the resulting zscore vectors. Using `rowMeans(y)` gives me real values, but somehow only the last row calculates it correctly. Do you have any idea why this would happen? – Henri Wathieu Feb 03 '15 at 15:15
  • `mean(y)` will return a missing value when there are any missing values in y (hence the above edits to `zScore`). Also, check the classes of the fields in your data.frame (`lapply(mxy,class)`) to make sure they're all numeric or integer (otherwise you may get some strange type coercion when taking a row out of mxy). otherwise, check out [this answer](http://stackoverflow.com/questions/4442518/general-suggestions-for-debugging-r/1882765#1882765) for some general debugging tips. – Jthorpe Feb 03 '15 at 16:40
  • The problem is that instead of running zScore row-by-row (i.e. row 1 of gex[,colName] and row 1 of mxy, then moving on to row 2 of both, etc), it runs each row of zScore against ONLY the last row of mxy, each time. How might I run it on corresponding rows? By the way, I've updated the script in the OP. – Henri Wathieu Feb 03 '15 at 17:47
  • It's running it on every row, but you'r over-writing the file for every row. the trick is to use `write.csv(...., append= i > 1)` (see edits to the response). – Jthorpe Feb 03 '15 at 17:55
  • Hi @Jthorpe, I tried "append == i > 1" with no effect. I am wondering if you can run the updated script I put in the OP and check out what the problem might be. I made it write some sheets in between to show that only one row from mxy is ever used. Sorry, I am very new to R!! – Henri Wathieu Feb 03 '15 at 18:13
  • I didn't realize that `write.csv` ignroed the `append`. I switched to `write.table` in the reponse, and also provided an alternative method that doesn't rely on `append`. – Jthorpe Feb 03 '15 at 18:41