2

I am looking for a better (faster) way to identify specific entries from a huge text file and than extract lines corresponding to that entry. The file is in format:

>Entry1.1
#size=1688
704 1   1   1   4
979 2   2   2   0
1220    1   1   1   4
1309    1   1   1   4
1316    1   1   1   4
1372    1   1   1   4
1374    1   1   1   4
1576    1   1   1   4
>Entry2.1
#size=6251
6110    3   1.5 0   2
6129    2   2   2   2
6136    1   1   1   4
6142    3   3   3   2
6143    4   4   4   1
6150    1   1   1   4
6152    1   1   1   4
>Entry3.2
#size=1777
AND SO ON-----------

I have reduced the number of corresponding lines (above) for each entry as they vary from few hundreds to few thousands. The size of file that holds all these entries range from 100 MB to 600MB. And the number of entries I usually need to identify and extract corresponding lines range from few hundred to 15,000. At present I am using REGEX to identify the entry name and than extract all corresponding lines till next '>' symbol. And to fasten the process I use 'multiprocessing' package in python 3.0. Here is the reduced code:

def OldMethod(entry):##Finds entry and extracts corresponding lines till next '>'
    patbase = '(>*%s(?![^\n]+?\d).+?)(?=>|(?:\s*\Z))'###pattern for extraction of gene entry
    found = re.findall(patbase % entry, denfile, re.DOTALL)
    if found: 
        print ('Entry found in density file\n')
        ''Do processing of corresponding line''
    return processed_result

def NewMethod(entry):##As suggested in this thread
    name = entry[1]
    block = den_dict[name]
    if found:
        ''Do processing of correponding lines in Block''

def PPResults(module,alist):##Parallel processing
    npool = Pool(int(nproc))    
    res = npool.map_async(module, alist)
    results = (res.get())###results returned in form of a list 
    return results

main():
    ##Read Density file, split by '>' and make dictionary for key and corresponding lines
    fh_in = open(density_file, 'r') ###HUGE TEXT FILE
    denfile = fh_in2.read().split('>')[1:] ###read once use repeatedly
    global den_dict
    den_dict = {}
    for ent in denfile:
        ent_splt = ent.split('\n')
        den_dict[ent_splt[0]] = ent_splt[2:-1]
    ##Use new approach with multiprocess
    results = PPResults(NewMethod, a_list)###'a_list' holds entries for that proceesing needs to be done
    for i in results:##Write Results from list to file
        fh_out.write(i)

I run this in on a server with more than 500GBs and 42 cores but still the script takes a lot of time (hrs to even a day) depending upon size of huge file and number entries to be processed. In whole process, most of time is taken up in locating specific entry as processing of entry is very basic .

What I am trying to achieve is reduce the run time as much as possible. Please suggest me what could be the fastest possible strategy to perform this analysis.

RESULTS:

After following 'Janne Karila' suggestion (below) and using 'NewMethod' (above) the the runtime for 300 entries is 120sec that include 85 seconds to read huge density file and split by '>' == 35 seconds to process 300 entries using 32 cores.

Where as using 'OldMethod' (above) with REGEX the runtime for 300 entries is 577 seconds that include ~102 seconds to read huge density file == 475 sec to process 300 entries using 32 cores.

The time to read huge file fluctuates b/w 12sec to 102 seconds, reason I am unsure of. Conclusively, the new method is at least 10~12 times faster. Seems like decent improvement for now.

Thanks

AK

Bade
  • 747
  • 3
  • 12
  • 28
  • Unix command line solution may be quicker. Use AWK – Alvin K. Mar 22 '13 at 06:15
  • 1
    This is most definitely I/O bound, not CPU bound. Adding CPU parallellization will only make it slower unless you can chunk the input and assign each core a separate chunk to process. – tripleee Mar 22 '13 at 06:47
  • Also `multiprocessing` gives quite some overhead. In this situation I believe the overhead would be much bigger then the gain you get, since each entry is small. – Bakuriu Mar 22 '13 at 06:48

3 Answers3

1

your main problem is you try to regex over the whole file, this takes enormous amount of time.

  1. you should not read the file as a whole, use .readlines() instead
  2. after you have split file into the lines, you may easily find new entry by checking only one very first symbol in the line, and only if this symbol is '>', apply regex to extract the entry number and so on.
  3. then again, you just iterate over line list, until the first symbol becomes '>' then stop.

Should not take more than 20 seconds for a 600MB file.

lenik
  • 23,228
  • 4
  • 34
  • 43
  • 1
    The `readlines` method *does* read the file as a whole, at least on python2, and in any case simply iterating over the file-object is more pythonic. – Bakuriu Mar 22 '13 at 06:46
  • @Bakuriu do you understand the difference between regexing 600MB and 60 bytes? my point is the file should be naturally split into the separate lines, which should drastically improve the speed. i don't care if it's loaded in memory, just don't store it there as a 600MB chunk of bytes. – lenik Mar 22 '13 at 06:50
  • I already knew what you meant, but this doesn't change the fact that your point 1 is written incorrectly. Also, I don't see what it adds since in point 2 you explain again that the file should be splitted into its lines. – Bakuriu Mar 22 '13 at 07:01
  • @Bakuriu iterating over the file might be significantly slower in some (rare?) cases: http://stackoverflow.com/a/6077078/1329296 and OP clearly stated he has no problem with file/memory size. – lenik Mar 22 '13 at 07:09
1

"Fastest possible" can be achieved by splitting the file ahead of time, into individual files in the file system, or by importing the data set to a database. Whether this makes sense in practice depends on the longevity of the data and your usage patterns. If you process a data set more than two or three times on average, it makes sense to pay up front for splitting the whole file the first time you need it, and then get subsequent queries on the same data more or less for free.

Taking implementation overhead into account, and given the suspicion that you could cut down on the execution time by several orders of magnitude by much simpler means, though, I would not attempt a significant overhaul unless you query the same data set thousands of times.

For simple splitting, man csplit. The program is a bit tricky to use, so you probably need more than just the manual page.

Hours for the splitting sounds seriously wrong, anyway. You should get it down to minutes just by reading a line at a time instead of attempting to read the whole file into core.

awk -v id="Entry1.1" '/^>/{p=($0 == ">" id)}p' file

This should translate to about a dozen lines of Python. Here's a proof of concept (in case you are not familiar with Awk and would like to understand what the above does, for example); it might not be the most elegant or idiomatic Python.

import sys, fileinput, os
entry = sys.argv[1]
p = False
for line in fileinput.input(sys.argv[2:]):
    line = line.rstrip(os.linesep)
    if line[0] == '>':
        if line[1:] == entry:
            p = True
        else:
            p = False
    if p:
        print line
tripleee
  • 175,061
  • 34
  • 275
  • 318
1

You can split the file in chunks by > and store them in a dictionary indexed by entry name.

d = dict(chunk.split(None,1) for chunk in denfile.split('>') if chunk)

Looking up an entry is then just d["Entry1.1"]


EDIT: Since you are spending a lot of time reading the file, you should try to get the processing done in parallel during that time already. You don't need to store the entire file anywhere; just send each wanted entry to processing as soon as you encounter it in the file.

def NewMethod(entry):
    '''Do processing of correponding lines in Block'''

def retrieve_chunks(filename, alist):
    '''Generator that yields entries from file when entry name is in alist'''
    aset = set(alist)   #use a set for fast lookups
    chunk = None
    with open(filename) as f:
        for line in f:
            if line.startswith('>'):
                if chunk:
                    yield chunk
                name = line[1:].strip() 
                if name in aset:
                    chunk = [name] #enables capture of subsequent lines
                else:
                    chunk = None   #disables capture
            elif chunk:
                chunk.append(line)
    if chunk:
        yield chunk

main():
    npool = Pool(int(nproc))
    results = []
    for entry in retrieve_chunks(density_file, a_list): 
        results.append(npool.apply_async(NewMethod, (entry,)))

    for proxy in results:
        fh_out.write(proxy.get())

Note, by the way, that if you would pass the generator to Pool.map_async, it would read all of it before starting any work. That's why I'm using apply_async in a loop instead.

Janne Karila
  • 24,266
  • 6
  • 53
  • 94
  • I have edited my code and added a new 'newMethod' function. As, you suggested the density file is read once and a dictionary is made with entry name as key and corresponding lines as value. This dictionary is looked up with key (entry name) to get corresponding lines in 'NewMethod' function which is further parallelized by 'PPResults' function. Could you please tell me how can I make index of dictionary keys and implement it? – Bade Mar 22 '13 at 17:15
  • @AK. By "indexed" I simply meant using entry name as the key of the dictionary, which is exactly what you seem to have done in the new code. Maybe you should try this version first without multiprocessing. – Janne Karila Mar 22 '13 at 17:27
  • @AK `denfile.split('>')` is very-very slow, you should have listened to my suggestion below. it might still easily cut your run time in half. – lenik Mar 23 '13 at 01:52
  • @lenik `str.split` processes hundreds of megabytes in a second on my old box, so it would not be the bottleneck here. – Janne Karila Mar 23 '13 at 08:50
  • @AK. I added a different approach to my answer. – Janne Karila Mar 23 '13 at 10:08