I have a list named master_lst
created from a CSV file using the following code
infile= open(sys.argv[1], "r")
lines = infile.readlines()[1:]
master_lst = ["read"]
for line in lines:
line= line.strip().split(',')
fourth_field = line [3]
master_lst.append(fourth_field)
This master list has the unique set of sequences. Now I have to loop 30 collapsed FASTA files to count the number of occurrences of each of these sequences in the master list. The file format of the 30 files is as follow:
>AAAAAAAAAAAAAAA
7451
>AAAAAAAAAAAAAAAA
4133
>AAAAAAAAAAAAAAAAA
2783
For counting the number of occurrences, I looped through each of the 30 file and created a dictionary with sequences as key and number of occurrences as values. Then I iterated each element of the master_lst
and matched it with the key in the dictionary created from the previous step. If there is a match, I appended the value of the key to a new list (ind_lst
). If not I appended 0 to the ind_lst
. The code for that is as follow:
for file in files:
ind_lst = []
if file.endswith('.fa'):
first = file.split(".")
first_field = first [0]
ind_lst.append(first_field)
fasta= open(file)
individual_dict= {}
for line in fasta:
line= line.strip()
if line == '':
continue
if line.startswith('>'):
header = line.lstrip('>')
individual_dict[header]= ''
else:
individual_dict[header] += line
for key in master_lst[1:]:
a = 0
if key in individual_dict.keys():
a = individual_dict[key]
else:
a = 0
ind_lst.append(a)
then I write the master_lst
to a CSV file and ind_lst
using the code explained here: How to append a new list to an existing CSV file?
The final output should look like this:
Read file1 file2 so on until file 30
AAAAAAAAAAAAAAA 7451 4456
AAAAAAAAAAAAAAAA 4133 3624
AAAAAAAAAAAAAAAAA 2783 7012
This codes work perfectly fine when I use a smaller master_lst
. But when the size of the master_lst
increases then the execution time increases too much. The master_lst
I am working with right now has 35,718,501 sequences(elements). When I subset 50 sequences and run the code, the script takes 2 hours to execute. So for 35,718,501 sequences it will take forever to complete.
Now I don't know how to speed up the script. I am not quite sure if there could be some improvements that can be made to this script to make it execute in a shorter time. I am running my script on a Linux server which has 16 CPU cores. When I use the command top, I could see that the script uses only one CPU. But I am not a expert in python and I don't know how to make it run on all available CPU cores using multiprocessing module. I checked this webpage: Learning Python's Multiprocessing Module.
But, I wasn't quite sure what should come under def
and if __name__ == '__main__':
. I am also not quite sure what arguments should I pass to the function. I was getting an error when I try the first code from Douglas, without passing any arguments as follow:
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
I have been working this for the last few days and I haven't been successful in generating my desired output. If anyone can suggest an alternative code that could run fast or if anyone could suggest how to run this code on multiple CPUs, that would be awesome. Any help to resolve this issue would be much appreciated.