3

I have a very big file (1.5 billion lines) in the following form:

1    67108547    67109226    gene1$transcript1    0    +    1    0
1    67108547    67109226    gene1$transcript1    0    +    2    1
1    67108547    67109226    gene1$transcript1    0    +    3    3
1    67108547    67109226    gene1$transcript1    0    +    4    4 

                                 .
                                 .
                                 .

1    33547109    33557650    gene2$transcript1    0    +    239    2
1    33547109    33557650    gene2$transcript1    0    +    240    0

                                 .
                                 .
                                 .

1    69109226    69109999    gene1$transcript1    0    +    351    1
1    69109226    69109999    gene1$transcript1    0    +    352    0 

What I want to do is to reorganize/sort this file based on the identifier on column 4. The file is consisted of blocks. If you concatenate columns 4,1,2 and 3 you create the unique identifier for each block. This is the key for the dicionary all_exons and the value is a numpy array containing all the values of column 8. Then I have a second dictionary unique_identifiers that has as key the attributes from column 4 and values a list of the corresponding block identifiers. As output I write a file in the following form:

>gene1
0
1
3
4
1
0
>gene2
2
0

I already wrote some code (see below) that does this, but my implementation is very slow. It takes around 18 hours to run.

import os
import sys
import time
from contextlib import contextmanager
import pandas as pd
import numpy as np


def parse_blocks(bedtools_file):
    unique_identifiers = {} # Dictionary with key: gene, value: list of exons
    all_exons = {} # Dictionary contatining all exons

    # Parse file and ...
    with open(bedtools_file) as fp:
        sp_line = []
        for line in fp:
            sp_line = line.strip().split("\t")
            current_id = sp_line[3].split("$")[0]

           identifier="$".join([sp_line[3],sp_line[0],sp_line[1],sp_line[2]])
           if(identifier in all_exons):
               item = float(sp_line[7])
               all_exons[identifier]=np.append(all_exons[identifier],item)
           else:
               all_exons[identifier] = np.array([sp_line[7]],float)

           if(current_id in unique_identifiers):
               unique_identifiers[current_id].add(identifier)
           else:
               unique_identifiers[current_id] =set([identifier])
  return unique_identifiers, all_exons

identifiers, introns = parse_blocks(options.bed)

w = open(options.out, 'w')
for gene in sorted(list(identifiers)):
    w.write(">"+str(gene)+"\n")
    for intron in sorted(list(identifiers[gene])):
        for base in introns[intron]:
            w.write(str(base)+"\n")
w.close()

How can I impove the above code in order to run faster?

fgypas
  • 326
  • 2
  • 11
  • 2
    The best way to find out how to make your code run faster, is by profiling: http://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script – Igor Jul 29 '15 at 13:07
  • 2
    Just to clarify: The final goal is to get the output file in the format you posted or do you need to have the data organized in dictionaries? Do you have also entries with 'transcript2' in this file or is it always 'transcript1'? – Cleb Jul 29 '15 at 13:07
  • The final goal is to get the output file. The reason I used dictionaries is that I pipe the input file while it is generated in order to save disk space. No I do not have entries with transcript2 (Sorry, I forgot to mention this). – fgypas Jul 29 '15 at 13:12
  • 1
    I did a profiling and it seems that the code delays while concatenating (numpy.core.multiarray.concatenate). So the problem seems to be on appending to the numpy array... – fgypas Jul 29 '15 at 13:16
  • @fitziano: I added a pandas based version below; hope that will help. One more question regarding the data: You will now have duplicates e.g. for gene1 you will have the values 0 and 1 twice - is that intended or shall duplicates be removed? – Cleb Jul 29 '15 at 15:03
  • @Cleb Thanks a lot for the solution. It is indeed helpful and it seems to work in a small test file. Yes dublicate entries should be there as it is now. What I am worried about with your solution is that the order of the genes is not guarantied. What I mean is that in my code above I print the genes in a specific way: (for gene in sorted(list(identifiers)) and (for intron in sorted(list(identifiers[gene]))). This guaranties that the order of the genes will always be in the same order. I am not sure whether this the case with your solution. – fgypas Jul 29 '15 at 15:42
  • 2
    @fitziano: How is the speed comparison going? Would be interesting how the panda solution compares to your approach... – Cleb Aug 17 '15 at 13:55

2 Answers2

1

You also import pandas, therefore, I provide a pandas solution which requires basically only two lines of code. However, I do not know how it performs on large data sets and whether that is faster than your approach (but I am pretty sure it is).

In the example below, the data you provide is stored in table.txt. I then use groupby to get all the values in your 8th column, store them in a list for the respective identifier in your column 4 (note that my indices start at 0) and convert this data structure into a dictionary which can then be printed easily.

import pandas as pd

df=pd.read_csv("table.txt", header=None, sep = r"\s+") # replace the separator by e.g. '/t'

op = dict(df.groupby(3)[7].apply(lambda x: x.tolist()))

So in this case op looks like this:

{'gene1$transcript1': [0, 1, 3, 4, 1, 0], 'gene2$transcript1': [2, 0]}

Now you could print the output like this and pipeline it in a certain file:

for k,v in op.iteritems():

    print k.split('$')[0]
    for val in v:
        print val

This gives you the desired output:

gene1
0
1
3
4
1
0
gene2
2
0

Maybe you can give it a try and let me know how it compares to your solution!?

Edit2:

In the comments you mentioned that you would like to print the genes in the correct order. You can do this as follows:

# add some fake genes to op from above
op['gene0$stuff'] = [7,9]       
op['gene4$stuff'] = [5,9]

# print using 'sorted'
for k,v in sorted(op.iteritems()):

    print k.split('$')[0]
    for val in v:
        print val

which gives you:

gene0
7
9
gene1
0
1
3
4
1
0
gene2
2
0
gene4
5
9

EDIT1:

I am not sure whether duplicates are intended but you could easily get rid of them by doing the following:

op2 = dict(df.groupby(3)[7].apply(lambda x: set(x)))

Now op2 would look like this:

{'gene1$transcript1': {0, 1, 3, 4}, 'gene2$transcript1': {0, 2}}

You print the output as before:

for k,v in op2.iteritems():

    print k.split('$')[0]
    for val in v:
        print val

which gives you

gene1
0
1
3
4
gene2
0
2
Cleb
  • 25,102
  • 20
  • 116
  • 151
0

I'll try to simplify your question, my solution is like this:

  • First, scan over the big file. For every different current_id, open a temporary file and append value of column 8 to that file.
  • After the full scan, catenate all chunks to a result file.

Here's the code:

# -*- coding: utf-8 -*-
import os
import tempfile
import subprocess


class ChunkBoss(object):
    """Boss for file chunks"""

    def __init__(self):
        self.opened_files = {}

    def write_chunk(self, current_id, value):
        if current_id not in self.opened_files:
            self.opened_files[current_id] = open(tempfile.mktemp(), 'wb')
            self.opened_files[current_id].write('>%s\n' % current_id)

        self.opened_files[current_id].write('%s\n' % value)

    def cat_result(self, filename):
        """Catenate chunks to one big file
        """
        # Sort the chunks
        chunk_file_list = []
        for current_id in sorted(self.opened_files.keys()):
            chunk_file_list.append(self.opened_files[current_id].name)

        # Flush chunks
        [chunk.flush() for chunk in self.opened_files.values()]

        # By calling cat command
        with open(filename, 'wb') as fp:
            subprocess.call(['cat', ] + chunk_file_list, stdout=fp, stderr=fp)

    def clean_up(self):
        [os.unlink(chunk.name) for chunk in self.opened_files.values()]


def main():
    boss = ChunkBoss()
    with open('bigfile.data') as fp:
        for line in fp:
            data = line.strip().split()
            current_id = data[3].split("$")[0]
            value = data[7]

            # Write value to temp chunk
            boss.write_chunk(current_id, value)

    boss.cat_result('result.txt')
    boss.clean_up()

if __name__ == '__main__':
    main()

I tested the performance of my script, with bigfile.data containing about 150k lines. It took about 0.5s to finish on my laptop. Maybe you can give it a try.

piglei
  • 1,188
  • 10
  • 13