I'm trying to run an analysis on some cancer data. I'm trying to use meta information about the patients to understand a variable. But the lm()
function isn't giving the output I expected. According to posts like these (Linear regression coefficient information as Data Frame or Matrix), the coefficients slot of the "lm" variable should be a matrix. However, mine is just a vector. The following is a reproducible example you can try. You're going to have to install a package which will fetch the meta data for you. That package is TCGAbiolinks
. You can install it with Bioconductor. If you don't have BiocManager
, you'll have to install that first. I apologize for the inconvenience.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("TCGAbiolinks")
After installing TCGAbiolinks
, you can download the data with
query <- GDCquery(project = "TCGA-BRCA",
data.category = "Clinical",
data.type = "Clinical Supplement",
data.format = "BCR Biotab")
GDCdownload(query)
clinData <- GDCprepare(query)
clinData <- as.data.frame(clinData$clinical_patient_brca)
clinData <- clinData[-c(1:2),]
Then run the following code. The output variable is numGenes
. I'm trying to use the columns in the data above to estimate it. In this reproducible example, you won't get any correlation because I just used random values for numGenes
. But this example demonstrates my problem. It's not giving me a matrix for the coefficients
slot of the lm
variable.
patientMeta <- clinData
patientMeta$Patient <- sapply(clinData[,2], function(x) return(strsplit(x, "-")[[1]][3]))
numGenes <- data.frame(sample(1:1500, nrow(patientMeta)))
row.names(numGenes) <- patientMeta$Patient
names(numGenes) <- "numGenes"
patientMeta <- merge(patientMeta, numGenes, by.x = 113, by.y = 0, all =F)
# Change string columns to factors and numeric columns to numeric
patientMeta2 <- as.data.frame(patientMeta)
charCols <- which(apply(patientMeta2, 2, function(x) {
if(is.na(as.numeric(x[1]))) {
return(T)
} else {
return(F)
}
}))
charCols <- names(patientMeta2)[charCols]
for (i in charCols) {
patientMeta2[,i] <- as.factor(patientMeta2[,i])
}
numCols <- which(!(names(patientMeta2) %in% charCols))
numCols <- names(patientMeta2)[numCols]
for (i in numCols) {
patientMeta2[,i] <- as.numeric(patientMeta2[,i])
}
# Remove columns that have no contrasts
patientMeta2 <- patientMeta2[,-which(sapply(1:ncol(patientMeta2), function(x) return(length(unique(patientMeta2[,x])))) == 1)]
# Remove columns that have na values
patientMeta2 <- patientMeta2[,which(apply(patientMeta2, 2, function(x) if(length(which(is.na(x))) > 0) return(F) else return(T)))]
# Remove columns that have incomplete information
patientMeta2 <- patientMeta2[,which(apply(patientMeta2, 2, function(x) length(grep("\\[Not", x))) < 10)]
# Remove ID columns
patientMeta2 <- patientMeta2[,-c(1,2,3,4,21)]
# Create formula for regression
lmFitExpression <- paste0(names(patientMeta2)[-ncol(patientMeta2)], collapse = " + ")
lmFitExpression <- paste("numGenes ~", lmFitExpression)
lmFitExpression <- formula(lmFitExpression)
# Do linear regression
theLM <- lm(lmFitExpression, patientMeta2)
Now if you were to look at the coefficients
> head(theLM$coefficients)
(Intercept) prospective_collectionNO prospective_collectionYES
1780.36026 -115.93932 -109.12016
retrospective_collectionNO retrospective_collectionYES genderMALE
NA NA 75.11251
You can see that it's not a matrix. I have no idea why I'm getting the data in this form. I'm interested in the p-value
column, but this seems to only be giving the Estimate
column