0

I have a list of genes, their coordinates, and their expression (right now just looking at the top 500 most highly expressed genes) and 12 files corresponding to DNA reads. I have a python script that searches for reads overlapping with each gene's coordinates and storing the values in a dictionary. I then use this dictionary to create a Pandas dataframe and save this as a csv. (I will be using these to create a scatterplot.)

The RNA file looks like this (the headers are gene name, chromosome, start, stop, gene coverage/enrichment):

MSTRG.38        NC_008066.1     9204    9987    48395.347656
MSTRG.36        NC_008066.1     7582    8265    47979.933594
MSTRG.33        NC_008066.1     5899    7437    43807.781250
MSTRG.49        NC_008066.1     14732   15872   26669.763672
MSTRG.38        NC_008066.1     8363    9203    19514.273438
MSTRG.34        NC_008066.1     7439    7510    16855.662109

And the DNA file looks like this (the headers are chromosome, start, stop, gene name, coverage, strand):

JQ673480.1      697     778     SRX6359746.5505370/2    8       +
JQ673480.1      744     824     SRX6359746.5505370/1    8       -
JQ673480.1      1712    1791    SRX6359746.2565519/2    27      +
JQ673480.1      3445    3525    SRX6359746.7028440/2    23      -
JQ673480.1      4815    4873    SRX6359746.6742605/2    37      +
JQ673480.1      5055    5092    SRX6359746.5420114/2    40      -
JQ673480.1      5108    5187    SRX6359746.2349349/2    24      -
JQ673480.1      7139    7219    SRX6359746.3831446/2    22      +

The RNA file has >9,000 lines, and the DNA files have > 12,000,000 lines.

I originally had a for-loop that would generate a dictionary containing all values for all 12 files in one go, but it runs extremely slowly. Since I have access to a computing system with multiple cores, I've decided to run a script that only calculates coverage one DNA file at a time, like so:

#import modules
import csv
import pandas as pd
import matplotlib.pyplot as plt

#set sample name
sample='CON-2'
#set fraction number
f=6

#dictionary to store values
d={}

#load file name into variable
fileRNA="top500_R8_7-{}-RNA.gtf".format(sample)
print(fileRNA)
#read tsv file
tsvRNA = open(fileRNA)
readRNA = csv.reader(tsvRNA, delimiter="\t")
expGenes=[]
#convert tsv file into Python list
for row in readRNA:
    gene=row[0],row[1],row[2],row[3],row[4]
    expGenes.append(row)
#print(expGenes)

#establish file name for DNA reads
fileDNA="D2_7-{}-{}.bed".format(sample,f)
print(fileDNA)
tsvDNA = open(fileDNA)
readDNA = csv.reader(tsvDNA, delimiter="\t")
#put file into Python list
MCNgenes=[]
for row in readDNA:
    read=row[0],row[1],row[2]
    MCNgenes.append(read)

#find read counts
for r in expGenes:
    #include FPKM in the dictionary
    d[r[0]]=[r[4]]
    regionCount=0
    #set start and stop points based on transcript file
    chr=r[1]
    start=int(r[2])
    stop=int(r[3])
    #print("start:",start,"stop:",stop)
    for row in MCNgenes:
        if start < int(row[1]) < stop:
            regionCount+=1
    d[r[0]].append(regionCount)
n+=1
df=pd.DataFrame.from_dict(d)
#convert to heatmap
df.to_csv("7-CON-2-6_forHeatmap.csv")

This script also runs quite slowly, however. Are there any changes I can make to get it run more efficiently?

Sarah
  • 79
  • 8

2 Answers2

2

If I understood correctly and you are trying to match between coordinates of genes in different files I believe the best option would be to use something like KDTree partitioning algorithm.

You can use KDtree to partition your DNA and RNA data. I'm assumming you're using 'start' and 'stop' as 'coordinates':

import pandas as pd
import numpy as np
from sklearn.neighbors import KDTree

dna = pd.DataFrame() # this is your dataframe with DNA data
rna = pd.DataFrame() # Same for RNA

# Let's assume you are using 'start' and 'stop' columns as coordinates
dna_coord = dna.loc[:, ['start', 'stop']]
rna_coord = rna.loc[:, ['start', 'stop']]

dna_kd = KDTree(dna_coord)
rna_kd = KDTree(rna_coord)

# Now you can go through your data and match with DNA:
my_data = pd.DataFrame()

for start, stop in zip(my_data.start, my_data.stop):
    coord = np.array(start, stop)
    dist, idx = dna_kd.query(coord, k=1)

    # Assuming you need an exact match
    if np.islose(dist, 0):
        # Now that you have the index of the matchin row in DNA data 
        # you can extract information using the index and do whatever
        # you want with it

        dna_gene_data = dna.loc[idx, :]

You can adjust your search parameters to get the desired results, but this will be much faster than searching every time.

NotAName
  • 3,821
  • 2
  • 29
  • 44
1

Generally, Python is extremely extremely easy to work with at the cost of it being inefficient! Scientific libraries (such as pandas and numpy) help here by only paying the Python overhead a minimum limited number of times to map the work into a convenient space, then doing the "heavy lifting" in a more efficient language (which may be quite painful/inconvenient to work with).

General advice

  • try to get data into a dataframe whenever possible and keep it there (do not convert data into some intermediate Python object like a list or dict)
  • try to use methods of the dataframe or parts of it to do work (such as .apply() and .map()-like methods)
  • whenever you must iterate in native Python, iterate on the shorter side of a dataframe (ie. there may be only 10 columns, but 10,000 rows ; go over the columns)

More on this topic here:

How to iterate over rows in a DataFrame in Pandas?
Answer: DON'T*!


Once you have a program, you can benchmark it by collecting runtime information. There are many libraries for this, but there is also a builtin one called cProfile which may work for you.

docs: https://docs.python.org/3/library/profile.html

python3 -m cProfile -o profile.out myscript.py
ti7
  • 16,375
  • 6
  • 40
  • 68
  • why iterate over columns with `for` and not `.apply()`? – jkr Dec 10 '20 at 00:15
  • There is no strict rule, but it can be easier to reason about and not too work-expensive (a list of python objects and references for each column vs one for each row). I've updated my example to be clearer about the case being when you have some reason to work in native Python there! – ti7 Dec 10 '20 at 00:19