I have StringTie data for a parental cell line and a KO cell line (which I'll refer to as B10). I am interested in comparing the parental and B10 cell lines. The issue seems to be that my StringTie files are separate, meaning I have one for the parental cell line and one for B10. I've included the code I have written to date for context along with the error messages I received and troubleshooting steps I have already tried. I have no idea where to go from here and I'd appreciate all the help I could get. This isn't something that anyone in my lab has done before so I'm struggling to do this without any guidance.
Thank you all in advance!
`# My code to go from StringTie to count data:
(I copy pasted this so all my notes are included. I'm new to R so they're really just for me. I'm not trying to explain to everyone what every bit of the code means condescendingly. You all likely know much more that I do)
# Open Data
# List StringTie output files for all samples
# All files should be in same directory
files_B10 <- list.files("C:/Users/kimbe/OneDrive/Documents/Lab/RNAseq/StringTie/data/B10", recursive = TRUE, full.names = TRUE)
files_parental <- list.files("C:/Users/kimbe/OneDrive/Documents/Lab/RNAseq/StringTie/data/parental", recursive = TRUE, full.names = TRUE)
tmp_B10 <- read_tsv(files_B10[1])
tx2gene_B10 <- tmp_B10[, c("t_name", "gene_name")]
txi_B10 <- tximport(files_B10, type = "stringtie", tx2gene = tx2gene_B10)
tmp_parental <- read_tsv(files_parental[1])
tx2gene_parental <- tmp_parental[, c("t_name", "gene_name")]
txi_parental <- tximport(files_parental, type = "stringtie", tx2gene = tx2gene_parental)
# Create a filter (vector) showing which rows have at least two columns with 5 or more counts
txi_B10.filter<-apply(txi_B10$counts,1,function(x) length(x[x>5])>=2)
txi_parental.filter<-apply(txi_parental$counts,1,function(x) length(x[x>5])>=2)
head(txi_parental.filter)
sum(txi_B10.filter)
# Now filter the txi object to keep only the rows of $counts, $abundance, and $length where the txi.filter value is >=5 is true
txi_B10$counts<-txi_B10$counts[txi_B10.filter,]
txi_B10$abundance<-txi_B10$abundance[txi_B10.filter,]
txi_B10$length<-txi_B10$length[txi_B10.filter,]
txi_parental$counts<-txi_parental$counts[txi_parental.filter,]
txi_parental$abundance<-txi_parental$abundance[txi_parental.filter,]
txi_parental$length<-txi_parental$length[txi_parental.filter,]
# save count data as csv files
write.csv(txi_B10$counts, "txi_B10.counts.csv")
write.csv(txi_parental$counts, "txi_parental.counts.csv")
# Open count data
# Do this in order that the files are organized in file manager
txi_B10_counts <- read_csv("txi_B10.counts.csv")
txi_parental_counts <- read_csv("txi_parental.counts.csv")
# Set column names
colnames(txi_B10_counts) = c("Gene_name", "B10_n1", "B10_n2")
View(txi_B10_counts)
colnames(txi_parental_counts) = c("Gene_name", "parental_n1", "parental_n2")
View(txi_parental_counts)
## R is case sensitive so you just wanna ensure that everything is in the same case
## convert Gene names which is column [[1]] into lowercase
txi_parental_counts[[1]] <- tolower( txi_parental_counts[[1]])
View(txi_parental_counts)
txi_B10_counts[[1]] <- tolower(txi_B10_counts[[1]])
View(txi_B10_counts)
## Capitalize the first letter of each gene name
capFirst <- function(s) {
paste(toupper(substring(s, 1, 1)), substring(s, 2), sep = "")
}
txi_parental_counts$Gene_name <- capFirst(txi_parental_counts$Gene_name)
View(txi_parental_counts)
capFirst <- function(s) {
paste(toupper(substring(s, 1, 1)), substring(s, 2), sep = "")
}
txi_B10_counts$Gene_name <- capFirst(txi_B10_counts$Gene_name)
View(txi_B10_counts)
# Merge PL and KO into one table
# full_join takes all counts from PL and KO even if the gene names are missing
# If a value is missing it writes it as NA
# This site explains different types of merging https://remiller1450.github.io/s230s19/Merging_and_Joining.html
mergedCounts <- full_join (x = txi_parental_counts, y = txi_B10_counts, by = "Gene_name")
view(mergedCounts)
# Replace NA with value = 0
mergedCounts[is.na(mergedCounts)] = 0
view(mergedCounts)
# Save file for merged counts
write.csv(mergedCounts, "MergedCounts.csv")
## --------------------------------------------------------------------------------
# My code to go from count data to DEseq2
# Import data
# I added my metadata incase the issue is how I set up the columns
# metaData is a file with your samples name and Comparison
# Your second column in metadata must be called Comparison, otherwise you'll get error in dds line
metadata <- read.csv(metadata.csv', header = TRUE, sep = ",")
countData <- read.csv('MergedCounts.csv', header = TRUE, sep = ",")
# Assign "Gene Names" as row names
# Notice how there's suddenly an extra row (x)?
# R automatically created and assigned column x as row names
# If you don't fix this the # of columns won't add up
rownames(countData) <- countData[,1]
countData <- countData[,-1]
# Create DEseq2 object
# !!!!!!! Here is where I get stuck!!!!!!!
dds <- DESeqDataSetFromMatrix(countData = countData,
colData = metaData,
design = ~ Comparison, tidy = TRUE)
# I can't run this line
# It says Error in DESeqDataSet(se, design = design, ignoreRank) : some values in assay are not integers
## --------------------------------------------------------------------------------
# How I tried to fix this:
# 1) I saw something here that suggested this might be an issue with having zeros in the count data
# I viewed the countData files to make sure there were no zeros and there weren't any
# I thought that would be the case since I replaced NA with value = 0 earlier using this bit of code
mergedCounts[is.na(mergedCounts)] = 0
view(mergedCounts)
# 2) I was then informed that StringTie outputs non integer values
# It was recommended that I try DESeqDataSetFromTximport instead
dds <- DESeqDataSetFromTximport(countData,
colData = metaData,
design = ~ Comparison, tidy = TRUE)
# I can't run this line either
# It says Error in DESeqDataSetFromTximport(countData, colData = metaData, design = ~Comparison, : is(txi, "list") is not TRUE
# I think this might be because merging the parental and B10 counts led to a file that's no longer a txi or accessible through Tximport
# It seems like this should be done with the original StringTie files from the very beginning of the code
# My concern with doing that is that the files for parental and B10 are separate so I don't see how I could end up comparing the two
# I think this approach would work if I was interested in comparing n1 verses n2 for each cell line but that is not of interest to me
`