How can I process my StringTie data so that I can run DEseq2 using R?

Question

I have StringTie data for a parental cell line and a KO cell line (which I'll refer to as B10). I am interested in comparing the parental and B10 cell lines. The issue seems to be that my StringTie files are separate, meaning I have one for the parental cell line and one for B10. I've included the code I have written to date for context along with the error messages I received and troubleshooting steps I have already tried. I have no idea where to go from here and I'd appreciate all the help I could get. This isn't something that anyone in my lab has done before so I'm struggling to do this without any guidance.

Thank you all in advance!

`# My code to go from StringTie to count data:

(I copy pasted this so all my notes are included. I'm new to R so they're really just for me. I'm not trying to explain to everyone what every bit of the code means condescendingly. You all likely know much more that I do)

# Open Data 
# List StringTie output files for all samples
# All files should be in same directory

files_B10 <- list.files("C:/Users/kimbe/OneDrive/Documents/Lab/RNAseq/StringTie/data/B10", recursive = TRUE, full.names = TRUE)
files_parental <- list.files("C:/Users/kimbe/OneDrive/Documents/Lab/RNAseq/StringTie/data/parental", recursive = TRUE, full.names = TRUE)

tmp_B10 <- read_tsv(files_B10[1])
tx2gene_B10 <- tmp_B10[, c("t_name", "gene_name")]
txi_B10 <- tximport(files_B10, type = "stringtie", tx2gene = tx2gene_B10)

tmp_parental <- read_tsv(files_parental[1])
tx2gene_parental <- tmp_parental[, c("t_name", "gene_name")]
txi_parental <- tximport(files_parental, type = "stringtie", tx2gene = tx2gene_parental)

# Create a filter (vector) showing which rows have at least two columns with 5 or more counts

txi_B10.filter<-apply(txi_B10$counts,1,function(x) length(x[x>5])>=2)
txi_parental.filter<-apply(txi_parental$counts,1,function(x) length(x[x>5])>=2)

head(txi_parental.filter)
sum(txi_B10.filter)

# Now filter the txi object to keep only the rows of $counts, $abundance, and $length where the txi.filter value is >=5 is true

txi_B10$counts<-txi_B10$counts[txi_B10.filter,]
txi_B10$abundance<-txi_B10$abundance[txi_B10.filter,]
txi_B10$length<-txi_B10$length[txi_B10.filter,]

txi_parental$counts<-txi_parental$counts[txi_parental.filter,]
txi_parental$abundance<-txi_parental$abundance[txi_parental.filter,]
txi_parental$length<-txi_parental$length[txi_parental.filter,]

# save count data as csv files
write.csv(txi_B10$counts, "txi_B10.counts.csv")  
write.csv(txi_parental$counts, "txi_parental.counts.csv")

# Open count data
# Do this in order that the files are organized in file manager

txi_B10_counts <- read_csv("txi_B10.counts.csv")

txi_parental_counts <- read_csv("txi_parental.counts.csv")

# Set column names

colnames(txi_B10_counts) = c("Gene_name", "B10_n1", "B10_n2")
  View(txi_B10_counts)
  
colnames(txi_parental_counts) = c("Gene_name", "parental_n1", "parental_n2")
  View(txi_parental_counts) 

## R is case sensitive so you just wanna ensure that everything is in the same case
## convert Gene names which is column [[1]] into lowercase 

txi_parental_counts[[1]] <- tolower( txi_parental_counts[[1]])
  View(txi_parental_counts)

txi_B10_counts[[1]] <- tolower(txi_B10_counts[[1]])
  View(txi_B10_counts)

## Capitalize the first letter of each gene name

capFirst <- function(s) {
paste(toupper(substring(s, 1, 1)), substring(s, 2), sep = "")
  }
txi_parental_counts$Gene_name <- capFirst(txi_parental_counts$Gene_name)
  View(txi_parental_counts)

capFirst <- function(s) {
  paste(toupper(substring(s, 1, 1)), substring(s, 2), sep = "")
}
txi_B10_counts$Gene_name <- capFirst(txi_B10_counts$Gene_name)
  View(txi_B10_counts)
 
# Merge PL and KO into one table
# full_join takes all counts from PL and KO even if the gene names are missing
# If a value is missing it writes it as NA
# This site explains different types of merging https://remiller1450.github.io/s230s19/Merging_and_Joining.html  
  
mergedCounts <- full_join (x = txi_parental_counts, y = txi_B10_counts, by = "Gene_name")
  view(mergedCounts)

# Replace NA with value = 0 
 
mergedCounts[is.na(mergedCounts)] = 0
 view(mergedCounts) 

# Save file for merged counts
 
write.csv(mergedCounts, "MergedCounts.csv")  

  
## -------------------------------------------------------------------------------- 

# My code to go from count data to DEseq2

# Import data
# I added my metadata incase the issue is how I set up the columns
# metaData is a file with your samples name and Comparison
# Your second column in metadata must be called Comparison, otherwise you'll get error in dds line


metadata <- read.csv(metadata.csv', header = TRUE, sep = ",")
countData <- read.csv('MergedCounts.csv', header = TRUE, sep = ",")

# Assign "Gene Names" as row names
# Notice how there's suddenly an extra row (x)?
# R automatically created and assigned column x as row names
# If you don't fix this the # of columns won't add up
   
rownames(countData)  <- countData[,1]
countData <- countData[,-1]

  
# Create DEseq2 object
# !!!!!!! Here is where I get stuck!!!!!!! 

dds <- DESeqDataSetFromMatrix(countData = countData,
                                colData = metaData,
                                design = ~ Comparison, tidy = TRUE)

# I can't run this line
# It says Error in DESeqDataSet(se, design = design, ignoreRank) : some values in assay are not integers  

## -------------------------------------------------------------------------------- 

# How I tried to fix this:

# 1) I saw something here that suggested this might be an issue with having zeros in the count data
# I viewed the countData files to make sure there were no zeros and there weren't any
# I thought that would be the case since I replaced NA with value = 0 earlier using this bit of code

mergedCounts[is.na(mergedCounts)] = 0
 view(mergedCounts) 

# 2) I was then informed that StringTie outputs non integer values
# It was recommended that I try DESeqDataSetFromTximport instead

dds <- DESeqDataSetFromTximport(countData,
                                colData = metaData,
                                design = ~ Comparison, tidy = TRUE)

# I can't run this line either
# It says Error in DESeqDataSetFromTximport(countData, colData = metaData, design = ~Comparison,  : is(txi, "list") is not TRUE

# I think this might be because merging the parental and B10 counts led to a file that's no longer a txi or accessible through Tximport
# It seems like this should be done with the original StringTie files from the very beginning of the code
# My concern with doing that is that the files for parental and B10 are separate so I don't see how I could end up comparing the two
# I think this approach would work if I was interested in comparing n1 verses n2 for each cell line but that is not of interest to me

`

Welcome to StackOverflow. Your code currently isn't [reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for others because we can't access your files like in `/Lab/RNAseq/StringTie/data/B10`. Can you make a minimal example using your data or example data where the same error message can be reproduced? — jrcalabrese, Dec 02 '22 at 17:22
Sorry! I didn't think about this. I uploaded the first 50 lines of my data here: https://github.com/Kmarti44/Example_StringTie — Kim Martins, Dec 05 '22 at 13:13

How can I process my StringTie data so that I can run DEseq2 using R?

0 Answers0