I have a list of genes, their coordinates, and their expression (right now just looking at the top 500 most highly expressed genes) and 12 files corresponding to DNA reads. I have a python script that searches for reads overlapping with each gene's coordinates and storing the values in a dictionary. I then use this dictionary to create a Pandas dataframe and save this as a csv. (I will be using these to create a scatterplot.)
The RNA file looks like this (the headers are gene name, chromosome, start, stop, gene coverage/enrichment):
MSTRG.38 NC_008066.1 9204 9987 48395.347656
MSTRG.36 NC_008066.1 7582 8265 47979.933594
MSTRG.33 NC_008066.1 5899 7437 43807.781250
MSTRG.49 NC_008066.1 14732 15872 26669.763672
MSTRG.38 NC_008066.1 8363 9203 19514.273438
MSTRG.34 NC_008066.1 7439 7510 16855.662109
And the DNA file looks like this (the headers are chromosome, start, stop, gene name, coverage, strand):
JQ673480.1 697 778 SRX6359746.5505370/2 8 +
JQ673480.1 744 824 SRX6359746.5505370/1 8 -
JQ673480.1 1712 1791 SRX6359746.2565519/2 27 +
JQ673480.1 3445 3525 SRX6359746.7028440/2 23 -
JQ673480.1 4815 4873 SRX6359746.6742605/2 37 +
JQ673480.1 5055 5092 SRX6359746.5420114/2 40 -
JQ673480.1 5108 5187 SRX6359746.2349349/2 24 -
JQ673480.1 7139 7219 SRX6359746.3831446/2 22 +
The RNA file has >9,000 lines, and the DNA files have > 12,000,000 lines.
I originally had a for-loop that would generate a dictionary containing all values for all 12 files in one go, but it runs extremely slowly. Since I have access to a computing system with multiple cores, I've decided to run a script that only calculates coverage one DNA file at a time, like so:
#import modules
import csv
import pandas as pd
import matplotlib.pyplot as plt
#set sample name
sample='CON-2'
#set fraction number
f=6
#dictionary to store values
d={}
#load file name into variable
fileRNA="top500_R8_7-{}-RNA.gtf".format(sample)
print(fileRNA)
#read tsv file
tsvRNA = open(fileRNA)
readRNA = csv.reader(tsvRNA, delimiter="\t")
expGenes=[]
#convert tsv file into Python list
for row in readRNA:
gene=row[0],row[1],row[2],row[3],row[4]
expGenes.append(row)
#print(expGenes)
#establish file name for DNA reads
fileDNA="D2_7-{}-{}.bed".format(sample,f)
print(fileDNA)
tsvDNA = open(fileDNA)
readDNA = csv.reader(tsvDNA, delimiter="\t")
#put file into Python list
MCNgenes=[]
for row in readDNA:
read=row[0],row[1],row[2]
MCNgenes.append(read)
#find read counts
for r in expGenes:
#include FPKM in the dictionary
d[r[0]]=[r[4]]
regionCount=0
#set start and stop points based on transcript file
chr=r[1]
start=int(r[2])
stop=int(r[3])
#print("start:",start,"stop:",stop)
for row in MCNgenes:
if start < int(row[1]) < stop:
regionCount+=1
d[r[0]].append(regionCount)
n+=1
df=pd.DataFrame.from_dict(d)
#convert to heatmap
df.to_csv("7-CON-2-6_forHeatmap.csv")
This script also runs quite slowly, however. Are there any changes I can make to get it run more efficiently?