0

I'm trying to run a survival analysis for hundreds of genes within a specific cancer type. I have 2 data frames (m2 and m3). m2 includes the sample ID as well as a column for Overall survival (how long the sample has been alive for) and status (if the sample is alive or deceased). In m3, I have one column for sample ID and columns 2:256 are different genes. If each sample has a mutation in this gene it was denoted by 1, if not, it was denoted by 0. I am trying to determine which genes are statistically significant when comparing their role in survival. I am trying to run a for loop to run this survdiff function and generate p-values, but keep getting an error.

for (x in 2:ncol(m3)) {survdiff(Surv(m2$Overall.Survival, m2$Status) ~ x, data = m3)}

The error I keep getting is:

Error in model.frame.default(formula = Surv(m2$Overall.Survival, m2$Status) ~  : 
  variable lengths differ (found for 'x')
r2evans
  • 141,215
  • 6
  • 77
  • 149
Matt
  • 17
  • 1
  • 5
  • The error seems clear, what do `nrow(m2)` and `nrow(m3)` return? – r2evans Feb 11 '18 at 03:42
  • I was hoping to return a survdiff p-value for each column (gene) for columns 2:256 (all genes that I have included). – Matt Feb 11 '18 at 03:45
  • 1
    Okay, but what does that have to do with the error or my question? – r2evans Feb 11 '18 at 03:47
  • I'm not quite sure I know the answer. I thought my code would give the output of the survdiff function for columns 2 - 256. Am I wrong about this? – Matt Feb 11 '18 at 03:51
  • 1
    You are not hearing me, Matt. All I asked was for you to return two numbers: the number of rows in each of your two data.frames. What is there to know? Looking at [your previous question](https://stackoverflow.com/questions/48726437/survdiff-p-value-comparison), though, I wonder: are you intentionally mixing data.frames here? Try `survdiff(Surv(m2$Overall.Survival, m2$Status) ~ x, data = m2)` (changing the last `m3` to `m2`). But, just like your previous question, since you provide no sample data, there is nothing for us to go on. – r2evans Feb 11 '18 at 04:07
  • You’re trying to iterate a function across columns using a for loop. You need to use `lapply` – Matt W. Feb 11 '18 at 04:42
  • 1
    Matt - how many columns does m2 have? If you want to process all rows... then you need for (x in 2:nrow) im guess you dont want col 1, as its likely an id or date or something... if you want to fun a function over all columns try: cols <-c(2:5) # set column range df[,cols] %<>% lapply(function(x) – Andrew Bannerman Feb 11 '18 at 04:47
  • library(magrittr) cols <-c(2:length(df)) # set column range.... df[,cols] %<>% lapply(function(x) your_fun(x))) – Andrew Bannerman Feb 11 '18 at 05:09
  • Matt, your other quesiton and this one suggest you could use a little advice on how to refine your question to improve our understanding and therefore chances of getting a usable answer. [SO help](https://stackoverflow.com/help/mcve) includes some, but a [previous SO q/a](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) does a decent job. Pay attention to the use of `dput` (with *reduced* data), and the section on minimal code. – r2evans Feb 11 '18 at 16:24
  • Sorry r2evans, both nrow(m2) and nrow(m3) produce 72. – Matt Feb 11 '18 at 18:17

1 Answers1

0

You have "x" as an index and are using it on the RHS of a formula. The RHS of a formula is not going to be able to accept a single number at a time and do anything useful with it. The RHS should be a name of a column, and furthermore, it should be in the form of a language object. One way of doing htis would be to use as.formula, but the way I propose seemed a bit simpler to me. You apparently want to use the "x" as an index into a column, so perhaps this code would deliver what you intended:

for (x in names(2:ncol(m3)) ) { print(  paste( x, 
        survdiff(Surv(Overall.Survival, Status) ~ ., 
                 data = cbind( m2[ c('Overall.Survival', 'Status')] , m2[ x])

                                     )$chisq) }

That moves the "x" into the role of a character variable selecting a single column and then the "dot" on the RHS selects anything in data argument that is not on the LHS.

I added the print because things done inside an R for-loop function results do not get passed out to the global environment. The loop itself returns NULL and only by 1) assignment or 2)print or 3)cat will you see results.

This is almost certainly doing it the wrong way on a statistical basis, and I think you should be consulting a statistician for help on understanding the serious pitfalls connected with "multiple comparisons".

IRTFM
  • 258,963
  • 21
  • 364
  • 487