0

I'm trying to run an analysis on some cancer data. I'm trying to use meta information about the patients to understand a variable. But the lm() function isn't giving the output I expected. According to posts like these (Linear regression coefficient information as Data Frame or Matrix), the coefficients slot of the "lm" variable should be a matrix. However, mine is just a vector. The following is a reproducible example you can try. You're going to have to install a package which will fetch the meta data for you. That package is TCGAbiolinks. You can install it with Bioconductor. If you don't have BiocManager, you'll have to install that first. I apologize for the inconvenience.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("TCGAbiolinks")

After installing TCGAbiolinks, you can download the data with

query <- GDCquery(project = "TCGA-BRCA", 
                  data.category = "Clinical",
                  data.type = "Clinical Supplement", 
                  data.format = "BCR Biotab")
GDCdownload(query)
clinData <- GDCprepare(query)
clinData <- as.data.frame(clinData$clinical_patient_brca)
clinData <- clinData[-c(1:2),]

Then run the following code. The output variable is numGenes. I'm trying to use the columns in the data above to estimate it. In this reproducible example, you won't get any correlation because I just used random values for numGenes. But this example demonstrates my problem. It's not giving me a matrix for the coefficients slot of the lm variable.

patientMeta <- clinData
patientMeta$Patient <- sapply(clinData[,2], function(x) return(strsplit(x, "-")[[1]][3]))

numGenes <- data.frame(sample(1:1500, nrow(patientMeta)))
row.names(numGenes) <- patientMeta$Patient
names(numGenes) <- "numGenes"

patientMeta <- merge(patientMeta, numGenes, by.x = 113, by.y = 0, all =F)

# Change string columns to factors and numeric columns to numeric
patientMeta2 <- as.data.frame(patientMeta)
charCols <- which(apply(patientMeta2, 2, function(x) {
  if(is.na(as.numeric(x[1]))) {
    return(T)
  } else {
    return(F)
  }
}))
charCols <- names(patientMeta2)[charCols]
for (i in charCols) {
  patientMeta2[,i] <- as.factor(patientMeta2[,i])
}
numCols <- which(!(names(patientMeta2) %in% charCols))
numCols <- names(patientMeta2)[numCols]
for (i in numCols) {
  patientMeta2[,i] <- as.numeric(patientMeta2[,i])
}
# Remove columns that have no contrasts
patientMeta2 <- patientMeta2[,-which(sapply(1:ncol(patientMeta2), function(x) return(length(unique(patientMeta2[,x])))) == 1)]
# Remove columns that have na values
patientMeta2 <- patientMeta2[,which(apply(patientMeta2, 2, function(x) if(length(which(is.na(x))) > 0) return(F) else return(T)))]
# Remove columns that have incomplete information
patientMeta2 <- patientMeta2[,which(apply(patientMeta2, 2, function(x) length(grep("\\[Not", x))) < 10)]
# Remove ID columns
patientMeta2 <- patientMeta2[,-c(1,2,3,4,21)]

# Create formula for regression
lmFitExpression <- paste0(names(patientMeta2)[-ncol(patientMeta2)], collapse = " + ")
lmFitExpression <- paste("numGenes ~", lmFitExpression)
lmFitExpression <- formula(lmFitExpression)
# Do linear regression
theLM <- lm(lmFitExpression, patientMeta2)

Now if you were to look at the coefficients

> head(theLM$coefficients)
                (Intercept)    prospective_collectionNO   prospective_collectionYES 
                 1780.36026                  -115.93932                  -109.12016 
 retrospective_collectionNO retrospective_collectionYES                  genderMALE 
                         NA                          NA                    75.11251 

You can see that it's not a matrix. I have no idea why I'm getting the data in this form. I'm interested in the p-value column, but this seems to only be giving the Estimate column

Zuhaib Ahmed
  • 487
  • 4
  • 14

0 Answers0