1

I have a list named master_lst created from a CSV file using the following code

infile= open(sys.argv[1], "r")
lines = infile.readlines()[1:]
master_lst = ["read"]
for line in lines:
 line= line.strip().split(',')
 fourth_field = line [3]
 master_lst.append(fourth_field)

This master list has the unique set of sequences. Now I have to loop 30 collapsed FASTA files to count the number of occurrences of each of these sequences in the master list. The file format of the 30 files is as follow:

>AAAAAAAAAAAAAAA
7451
>AAAAAAAAAAAAAAAA
4133
>AAAAAAAAAAAAAAAAA
2783

For counting the number of occurrences, I looped through each of the 30 file and created a dictionary with sequences as key and number of occurrences as values. Then I iterated each element of the master_lst and matched it with the key in the dictionary created from the previous step. If there is a match, I appended the value of the key to a new list (ind_lst). If not I appended 0 to the ind_lst. The code for that is as follow:

for file in files:
 ind_lst = []
 if file.endswith('.fa'):
  first = file.split(".")
  first_field = first [0]
  ind_lst.append(first_field)
  fasta= open(file)
  individual_dict= {}
  for line in fasta:
   line= line.strip()
   if line == '':
    continue
   if line.startswith('>'):
    header = line.lstrip('>')
    individual_dict[header]= ''
   else:
    individual_dict[header] += line
  for key in master_lst[1:]:
   a = 0
   if key in individual_dict.keys():
     a = individual_dict[key]
   else:
     a = 0
   ind_lst.append(a)

then I write the master_lst to a CSV file and ind_lst using the code explained here: How to append a new list to an existing CSV file?

The final output should look like this:

Read                           file1     file2 so on until file 30
AAAAAAAAAAAAAAA                 7451      4456
AAAAAAAAAAAAAAAA                4133      3624
AAAAAAAAAAAAAAAAA               2783      7012

This codes work perfectly fine when I use a smaller master_lst. But when the size of the master_lst increases then the execution time increases too much. The master_lst I am working with right now has 35,718,501 sequences(elements). When I subset 50 sequences and run the code, the script takes 2 hours to execute. So for 35,718,501 sequences it will take forever to complete.

Now I don't know how to speed up the script. I am not quite sure if there could be some improvements that can be made to this script to make it execute in a shorter time. I am running my script on a Linux server which has 16 CPU cores. When I use the command top, I could see that the script uses only one CPU. But I am not a expert in python and I don't know how to make it run on all available CPU cores using multiprocessing module. I checked this webpage: Learning Python's Multiprocessing Module.

But, I wasn't quite sure what should come under def and if __name__ == '__main__':. I am also not quite sure what arguments should I pass to the function. I was getting an error when I try the first code from Douglas, without passing any arguments as follow:

  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run

self._target(*self._args, **self._kwargs)

I have been working this for the last few days and I haven't been successful in generating my desired output. If anyone can suggest an alternative code that could run fast or if anyone could suggest how to run this code on multiple CPUs, that would be awesome. Any help to resolve this issue would be much appreciated.

Community
  • 1
  • 1
rex
  • 47
  • 1
  • 7

1 Answers1

1

Here's a multiprocessing version. It uses a slightly different approach than you do in your code which does away with the need for creating the ind_lst.

The essence of the difference is that it first produces a transpose of the desired data, and then transpose that into the desired result.

In other words, instead of creating this directly:

Read,file1,file2
AAAAAAAAAAAAAAA,7451,4456
AAAAAAAAAAAAAAAA,4133,3624
AAAAAAAAAAAAAAAAA,2783,7012

It first produces:

Read,AAAAAAAAAAAAAAA,AAAAAAAAAAAAAAAA,AAAAAAAAAAAAAAAAA 
file1,7451,4133,2783
file2,4456,3624,7012

...and then transposes that with the built-in zip() function to obtain the desired format.

Besides not needing to create the ind_lst, it also allows the creation of one row of data per file rather than one column of it (which is easier and can be done more efficiently with less effort).

Here's the code:

from __future__ import print_function

import csv
from functools import partial
from glob import glob
from itertools import izip  # Python 2
import operator
import os
from multiprocessing import cpu_count, Pool, Queue
import sys

def get_master_list(filename):
    with open(filename, "rb") as csvfile:
        reader = csv.reader(csvfile)
        next(reader)  # ignore first row
        sequence_getter = operator.itemgetter(3)  # retrieves fourth column of each row
        return map(sequence_getter, reader)

def process_fa_file(master_list, filename):
    fa_dict = {}
    with open(filename) as fa_file:
        for line in fa_file:
            if line and line[0] != '>':
                fa_dict[sequence] = int(line)
            elif line:
                sequence = line[1:-1]

    get = fa_dict.get  # local var to expedite access
    basename = os.path.basename(os.path.splitext(filename)[0])
    return [basename] + [get(key, 0) for key in master_list]

def process_fa_files(master_list, filenames):
    pool = Pool(processes=4)  # "processes" is the number of worker processes to
                              # use. If processes is None then the number returned
                              # by cpu_count() is used.
    # Only one argument can be passed to the target function using Pool.map(),
    # so create a partial to pass first argument, which doesn't vary.
    results = pool.map(partial(process_fa_file, master_list), filenames)
    header_row = ['Read'] + master_list
    return [header_row] + results

if __name__ == '__main__':
    master_list = get_master_list('master_list.csv')

    fa_files_dir = '.'  # current directory
    filenames = glob(os.path.join(fa_files_dir, '*.fa'))

    data = process_fa_files(master_list, filenames)

    rows = zip(*data)  # transpose
    with open('output.csv', 'wb') as outfile:
        writer = csv.writer(outfile)
        writer.writerows(rows)

    # show data written to file
    for row in rows:
        print(','.join(map(str, row)))
martineau
  • 119,623
  • 25
  • 170
  • 301
  • Thank you so much for showing how to solve the issue with multiprocessing module.Thank you so much for your time and effort. I used profiling for the previous script, but i wasn't quite sure how to find the exact bottleneck. When i finally found, it was the if statement after the for loop which was problematic. I tried a different solution using pandas data frame (without multiprocessing). I was just closer to solve my issue and i also got your answer. Thanks :) I am a biologist who started learning Python recently, but will definitely integrate your comments on script writing – rex Mar 13 '17 at 10:06
  • @rex: Is this multiprocessing approach any faster? – martineau Mar 17 '17 at 23:01