I am looking for a better (faster) way to identify specific entries from a huge text file and than extract lines corresponding to that entry. The file is in format:
>Entry1.1
#size=1688
704 1 1 1 4
979 2 2 2 0
1220 1 1 1 4
1309 1 1 1 4
1316 1 1 1 4
1372 1 1 1 4
1374 1 1 1 4
1576 1 1 1 4
>Entry2.1
#size=6251
6110 3 1.5 0 2
6129 2 2 2 2
6136 1 1 1 4
6142 3 3 3 2
6143 4 4 4 1
6150 1 1 1 4
6152 1 1 1 4
>Entry3.2
#size=1777
AND SO ON-----------
I have reduced the number of corresponding lines (above) for each entry as they vary from few hundreds to few thousands. The size of file that holds all these entries range from 100 MB to 600MB. And the number of entries I usually need to identify and extract corresponding lines range from few hundred to 15,000. At present I am using REGEX to identify the entry name and than extract all corresponding lines till next '>' symbol. And to fasten the process I use 'multiprocessing' package in python 3.0. Here is the reduced code:
def OldMethod(entry):##Finds entry and extracts corresponding lines till next '>'
patbase = '(>*%s(?![^\n]+?\d).+?)(?=>|(?:\s*\Z))'###pattern for extraction of gene entry
found = re.findall(patbase % entry, denfile, re.DOTALL)
if found:
print ('Entry found in density file\n')
''Do processing of corresponding line''
return processed_result
def NewMethod(entry):##As suggested in this thread
name = entry[1]
block = den_dict[name]
if found:
''Do processing of correponding lines in Block''
def PPResults(module,alist):##Parallel processing
npool = Pool(int(nproc))
res = npool.map_async(module, alist)
results = (res.get())###results returned in form of a list
return results
main():
##Read Density file, split by '>' and make dictionary for key and corresponding lines
fh_in = open(density_file, 'r') ###HUGE TEXT FILE
denfile = fh_in2.read().split('>')[1:] ###read once use repeatedly
global den_dict
den_dict = {}
for ent in denfile:
ent_splt = ent.split('\n')
den_dict[ent_splt[0]] = ent_splt[2:-1]
##Use new approach with multiprocess
results = PPResults(NewMethod, a_list)###'a_list' holds entries for that proceesing needs to be done
for i in results:##Write Results from list to file
fh_out.write(i)
I run this in on a server with more than 500GBs and 42 cores but still the script takes a lot of time (hrs to even a day) depending upon size of huge file and number entries to be processed. In whole process, most of time is taken up in locating specific entry as processing of entry is very basic .
What I am trying to achieve is reduce the run time as much as possible. Please suggest me what could be the fastest possible strategy to perform this analysis.
RESULTS:
After following 'Janne Karila' suggestion (below) and using 'NewMethod' (above) the the runtime for 300 entries is 120sec that include 85 seconds to read huge density file and split by '>' == 35 seconds to process 300 entries using 32 cores.
Where as using 'OldMethod' (above) with REGEX the runtime for 300 entries is 577 seconds that include ~102 seconds to read huge density file == 475 sec to process 300 entries using 32 cores.
The time to read huge file fluctuates b/w 12sec to 102 seconds, reason I am unsure of. Conclusively, the new method is at least 10~12 times faster. Seems like decent improvement for now.
Thanks
AK