Beyond for-looping: high performance parsing of a large, well formatted data file

Question

I am looking to optimize the performance of a big data parsing problem I have using python. In case anyone is interested: the data shown below is segments of whole genome DNA sequence alignments for six primate species.

Currently, the best way I know how to proceed with this type of problem is to open each of my ~250 (size 20-50MB) files, loop through line by line and extract the data I want. The formatting (shown in examples) is fairly regular although there are important changes at each 10-100 thousand line segment. Looping works fine but it is slow.

I have been using numpy recently for processing massive (>10 GB) numerical data sets and I am really impressed at how quickly I am able to perform different computations on arrays. I wonder if there are some high-powered solutions for processing formatted text that circumvents tedious for-looping?

My files contain multiple segments with the pattern

<MULTI-LINE HEADER>  # number of header lines mirrors number of data columns
<DATA BEGIN FLAG>  # the word 'DATA'
<DATA COLUMNS>  # variable number of columns
<DATA END FLAG>  # the pattern '//'
<EMPTY LINE>

Example:

# key to the header fields:
# header_flag chromosome segment_start segment_end quality_flag chromosome_data
SEQ homo_sapiens 1 11388669 11532963 1 (chr_length=249250621)
SEQ pan_troglodytes 1 11517444 11668750 1 (chr_length=229974691)
SEQ gorilla_gorilla 1 11607412 11751006 1 (chr_length=229966203)
SEQ pongo_pygmaeus 1 218866021 219020464 -1 (chr_length=229942017)
SEQ macaca_mulatta 1 14425463 14569832 1 (chr_length=228252215)
SEQ callithrix_jacchus 7 45949850 46115230 1 (chr_length=155834243)
DATA
GGGGGG
CCCCTC
......  # continue for 10-100 thousand lines
//

SEQ homo_sapiens 1 11345717 11361846 1 (chr_length=249250621)
SEQ pan_troglodytes 1 11474525 11490638 1 (chr_length=229974691)
SEQ gorilla_gorilla 1 11562256 11579393 1 (chr_length=229966203)
SEQ pongo_pygmaeus 1 219047970 219064053 -1 (chr_length=229942017)
DATA
CCCC
GGGG
....  # continue for 10-100 thousand lines
//

<ETC>

I will use segments where the species homo_sapiens and macaca_mulatta are both present in the header, and field 6, which I called the quality flag in the comments above, equals '1' for each species. Since macaca_mulatta does not appear in the second example, I would ignore this segment completely.

I care about segment_start and segment_end coordinates for homo_sapiens only, so in segments where homo_sapiens is present, I will record these fields and use them as keys to a dict(). segment_start also tells me the first positional coordinate for homo_sapiens, which increases strictly by 1 for each line of data in the current segment.

I want to compare the letters (DNA bases) for homo_sapiens and macaca_mulatta. The header line where homo_sapiens and macaca_mulatta appear (i.e. 1 and 5 in the first example) correspond to the column of data representing their respective sequences.

Importantly, these columns are not always the same, so I need to check the header to get the correct indices for each segment, and to check that both species are even in the current segment.

Looking at the two lines of data in example 1, the relevant information for me is

# homo_sapiens_coordinate homo_sapiens_base macaca_mulatta_base
11388669 G G
11388670 C T

For each segment containing info for homo_sapiens and macaca_mulatta, I will record start and end for homo_sapiens from the header and each position where the two DO NOT match into a list. Finally, some positions have "gaps" or lower quality data, i.e.

aaa--A

I will only record from positions where homo_sapiens and macaca_mulatta both have valid bases (must be in the set ACGT) so the last variable I consider is a counter of valid bases per segment.

My final data structure for a given file is a dictionary which looks like this:

{(segment_start=i, segment_end=j, valid_bases=N): list(mismatch positions), 
    (segment_start=k, segment_end=l, valid_bases=M): list(mismatch positions), ...}

Here is the function I have written to carry this out using a for-loop:

def human_macaque_divergence(chromosome):

    """
    A function for finding the positions of human-macaque divergent sites within segments of species alignment tracts
    :param chromosome: chromosome (integer:
    :return div_dict: a dictionary with tuple(segment_start, segment_end, valid_bases_in_segment) for keys and list(divergent_sites) for values
    """

    ch = str(chromosome)
    div_dict = {}

    with gz.open('{al}Compara.6_primates_EPO.chr{c}_1.emf.gz'.format(al=pd.align, c=ch), 'rb') as f:

        # key to the header fields:
        # header_flag chromosome segment_start segment_end quality_flag chromosome_info
        # SEQ homo_sapiens 1 14163 24841 1 (chr_length=249250621)

        # flags, containers, counters and indices:
        species   = []
        starts    = []
        ends      = []
        mismatch  = []

        valid        = 0
        pos          = -1
        hom          = None
        mac          = None

        species_data = False  # a flag signalling that the lines we are viewing are alignment columns

        for line in f:

            if 'SEQ' in line:  # 'SEQ' signifies a segment info field

                assert species_data is False
                line = line.split()

                if line[2] == ch and line[5] == '1':  # make sure that the alignment is to the desired chromosome in humans quality_flag is '1'

                    species += [line[1]]  # collect each species in the header
                    starts  += [int(line[3])]  # collect starts and ends
                    ends    += [int(line[4])]

            if 'DATA' in line and {'homo_sapiens', 'macaca_mulatta'}.issubset(species):

                species_data = True

                # get the indices to scan in data columns:
                hom       = species.index('homo_sapiens') 
                mac       = species.index('macaca_mulatta')
                pos       = starts[hom]  # first homo_sapiens positional coordinate

                continue

            if species_data and '//' not in line:

                assert pos > 0

                # record the relevant bases:
                human   = line[hom]
                macaque = line[mac]

                if {human, macaque}.issubset(bases):
                    valid += 1

                if human != macaque and {human, macaque}.issubset(bases):
                    mismatch += [pos]

                pos += 1

            elif species_data and '//' in line:  # '//' signifies segment boundary

                # store segment results if a boundary has been reached and data has been collected for the last segment:
                div_dict[(starts[hom], ends[hom], valid)] = mismatch

                # reset flags, containers, counters and indices
                species   = []
                starts    = []
                ends      = []
                mismatch  = []

                valid        = 0
                pos          = -1
                hom          = None
                mac          = None
                species_data = False

            elif not species_data and '//' in line:

                # reset flags, containers, counters and indices
                species   = []
                starts    = []
                ends      = []

                pos       = -1
                hom       = None
                mac       = None

    return div_dict

This code works fine (perhaps it could use some tweaking), but my real question is whether or not there might be a faster way to pull this data without running the for-loop and examining each line? For example, loading the whole file using f.read() takes less than a second although it creates a pretty complicated string. (In principle, I assume that I could use regular expressions to parse at least some of the data, such as the header info, but I'm not sure if this would necessarily increase performance without some bulk method to process each data column in each segment).

Does anyone have any suggestions as to how I circumvent looping through billions of lines and parse this kind of text file in a more bulk manner?

Please let me know if anything is unclear in comments, happy to edit or respond directly to improve the post!

They do have good size static ram devices nowadays. Maybe an old fashion ram drive mapped to static ram might help. Regular expressions are definitely the way to go if you slurp in the whole file into a variable. Reason is that the engine stays in the assembly level until it finds the block you are interested in. No spurrous language level interference. — , Jul 25 '15 at 19:35
There is another way to process bulk, but its more difficult. You'd be in a loop reading a large fixed chunk at a time. It has to be bigger than the biggest block size you expect (probably 2 or 3 times for safety). Each pass through the loop you'd have to find the start of the block. If the end of the block is not in the cache, you would rotate the block start to cache start, then read in the next fixed file segment and append it to the cache, rinse and repeate. If no block is found, clear the cache. — , Jul 25 '15 at 19:44
@sln could you elaborate what static ram is? As I said, I don't have a problem with the for-loop method per se, particularly if there aren't any better approaches but as a novice programmer I thought I'd pose the question ;-) — isosceleswheel, Jul 25 '15 at 20:22
@sln saw your newer comment. So maybe in effect I could do an `f.read()` method and then split on the `` and process these chunks in some more efficient way, is that along the lines of your suggestion? What does "rotate the block" mean? — isosceleswheel, Jul 25 '15 at 20:26
No, split would not be the way to go. The method of the _caching_ of file data all depends on you defining a _block_. The single boundary of start/end (`//` ?) from which you can _possibly get valid data. The cache size should be bigger ( 2 or 3 times) than the _block_ size. After you process the last block, copy the remaining cache data (from where the last block ended, to the end of cache) to a new cache (or overwrite the existing one). Read a new file chunk and append it to the cache. Process from beginning of cache again. — , Jul 25 '15 at 22:19
@sln OK, my follow up questions would be 1) how do I load a cache of the data and then 2) within that cached subset of the data, how would you recommend I then parse in a way that is superior to simply looping through lines? Also, are we still talking python specifically or more general? Thanks for your input :-) — isosceleswheel, Jul 26 '15 at 00:23
Optimization problems are best suited at [codereview.stackexchange.com](http://codereview.stackexchange.com). — Burhan Khalid, Jul 26 '15 at 09:32
@BurhanKhalid thanks, I will take my question over there as well. I just assumed the traffic was a lot higher here and might get better visibility. — isosceleswheel, Jul 26 '15 at 11:20
You must be certain that the cache size is a minimum of 2 times the maximum size of a _data block_. You operate on the cache data directly using regular expressions. Example, should always be at least 2 data data separator's in the cache (can be more too), cache = `// bunch of stuff // more stuff (could be incomplete) .. until not another //`. Grab the data with a regex. After done grabbing, strip the complete blocks in the cache with a regex replace `.*(?=//)` with `""`. Now you have the start of the next block, read a chunk of file data, append it to cache. — , Jul 26 '15 at 20:38
From the `starts` and `ends` I gather that within a file segment the sequences for different species have different lengths, is this correct? Then what does the tail of a file segment look like? I think it should be possible to do quick processing with Numpy but ideally every data line (within a segment) should have the same number of characters. — , Aug 01 '15 at 21:34

score 1 · Answer 1 · answered Jul 25 '15 at 19:38

1

Yes you could use some regular expressions to make extract the data in one-go; this is probably the best ratio of effort/performances.

If you need more performances, you could use mx.TextTools to build a finite state machine; I'm pretty confident this will be significantly faster, but the effort needed to write the rules and the learning curve might discourage you.

You also could split the data in chunks and parallelize the processing, this could help.

answered Jul 25 '15 at 19:38

bufh

3,153
31
36

I think I might try reading in the complete file as a string and splitting on the segment boundaries I described above, that might be helpful, thanks for the suggestion. – isosceleswheel Jul 25 '15 at 20:50

score 1 · Answer 2 · answered Jul 25 '15 at 20:02

When you have working code and need to improve performance, use a profiler and measure the effect of one optimization at a time. (Even if you don't use the profiler, definitely do the latter.) Your present code looks reasonable, that is, I don't see anything "stupid" in it in terms of performance.

Having said that, it is likely to be worthwhile to use precompiled regular expressions for all string matching. By using re.MULTILINE, you can read in an entire file as a string and pull out parts of lines. For example:

s = open('file.txt').read()
p = re.compile(r'^SEQ\s+(\w+)\s+(\d+)\s+(\d+)\s+(\d+)', re.MULTILINE)
p.findall(s)

produces:

[('homo_sapiens', '1', '11388669', '11532963'),
 ('pan_troglodytes', '1', '11517444', '11668750'),
 ('gorilla_gorilla', '1', '11607412', '11751006'),
 ('pongo_pygmaeus', '1', '218866021', '219020464'),
 ('macaca_mulatta', '1', '14425463', '14569832'),
 ('callithrix_jacchus', '7', '45949850', '46115230')]

You will then need to post-process this data to deal with the specific conditions in your code, but the overall result may be faster.

I see where you are going with this but at the same time that `re.findall` call is going to find multiple instances of the multi-line header, because as I mentioned a given file has the `header \n data_lines \n end_flag` pattern multiple times, sometimes a few hundred. So the `re.findall` call would return a lot of groups. That being said, there might be a way to find all **headers** and index these and then find all **data_segments** and index these and **then** after perhaps map the params from the headers onto the corresponding data segments...? — isosceleswheel, Jul 25 '15 at 20:34
Also, with my basic string matching, like for example `if 'SEQ' in line:` and the tests under that `if` statement, do you really think pre-compiled regular expressions are going to make a huge difference? The three lists `species starts ends` I scoop up each time I encounter the header are tiny as is the bases set `'ACGT'`, so my intuition would be that in this case set membership testing may be faster than regex, but perhaps worth some simple benchmark testing on dummy data... — isosceleswheel, Jul 25 '15 at 20:55

score 1 · Answer 3 · edited Jun 20 '20 at 09:12

You can combine re with some fancy zipping in list comprehensions that can replace the for loops and try to squeeze some performance gains. Below I outline a strategy for segmenting the data file read in as an entire string:

import re
from itertools import izip #(if you are using py2x like me, otherwise just use zip for py3x)

s = open('test.txt').read()

Now find all header lines, and the corresponding index ranges in the large string

head_info = [(s[m.start():m.end()],m.start(), m.end()) for m in re.finditer('\nSEQ.*', s)]
head = [ h[0] for h in head_info]
head_inds = [ (h[1],h[2]) for h in head_info]

#head
#['\nSEQ homo_sapiens 1 11388669 11532963 1 (chr_length=249250621)',
# '\nSEQ pan_troglodytes 1 11517444 11668750 1 (chr_length=229974691)',
# '\nSEQ gorilla_gorilla 1 11607412 11751006 1 (chr_length=229966203)',
# '\nSEQ pongo_pygmaeus 1 218866021 219020464 -1 (chr_length=229942017)',
# '\nSEQ macaca_mulatta 1 14425463 14569832 1 (chr_length=228252215)',
# '\nSEQ callithrix_jacchus 7 45949850 46115230 1 (chr_length=155834243)',
# '\nSEQ homo_sapiens 1 11345717 11361846 1 (chr_length=249250621)',
#...
#head_inds
#[(107, 169),
# (169, 234),
# (234, 299),
# (299, 366),
# (366, 430),
# (430, 498),
# (1035, 1097),
# (1097, 1162)
# ...

Now, do the same for the data (lines of code with bases)

data_info = [(s[m.start():m.end()],m.start(), m.end()) for m in re.finditer('\n[AGCT-]+.*', s)]
data = [ d[0] for d in data_info]
data_inds = [ (d[1],d[2]) for d in data_info]

Now, whenever there is a new segment, there will be a discontinuity between head_inds[i][1] and head_inds[i+1][0]. Same for data_inds. We can use this knowledge to find the beginning and end of each segment as follows

head_seg_pos = [ idx+1 for idx,(i,j) in enumerate( izip( head_inds[:-1], head_inds[1:]))  if j[0]-i[1]]
head_seg_pos = [0] + head_seg_pos + [len(head_seg_pos)] # add beginning and end which we will use next
head_segmented = [ head[s1:s2] for s1,s2 in izip( head_seg_pos[:-1], head_seg_pos[1:]) ]
#[['\nSEQ homo_sapiens 1 11388669 11532963 1 (chr_length=249250621)',
#  '\nSEQ pan_troglodytes 1 11517444 11668750 1 (chr_length=229974691)',
#  '\nSEQ gorilla_gorilla 1 11607412 11751006 1 (chr_length=229966203)',
#  '\nSEQ pongo_pygmaeus 1 218866021 219020464 -1 (chr_length=229942017)',
#  '\nSEQ macaca_mulatta 1 14425463 14569832 1 (chr_length=228252215)',
#  '\nSEQ callithrix_jacchus 7 45949850 46115230 1 (chr_length=155834243)'],
#['\nSEQ homo_sapiens 1 11345717 11361846 1 (chr_length=249250621)',
#  '\nSEQ pan_troglodytes 1 11474525 11490638 1 (chr_length=229974691)',
# ...

and the same for the data

data_seg_pos = [ idx+1 for idx,(i,j) in enumerate( izip( data_inds[:-1], data_inds[1:]))  if j[0]-i[1]]
data_seg_pos = [0] + data_seg_pos + [len(data_inds)] # add beginning and end for the next step
data_segmented = [ data[s1:s2] for s1,s2 in izip( data_seg_pos[:-1], data_seg_pos[1:]) ]

Now we can group the segmented data and segmented headers, and only keep groups with data on homo_sapiens and macaca_mulatta

groups = [ [h,d] for h,d in izip( head_segmented, data_segmented) if all( [sp in ''.join(h) for sp in ('homo_sapiens','macaca_mulatta')] ) ]

Now you have a groups array, where each group has

group[0][0] #headers for segment 0
#['\nSEQ homo_sapiens 1 11388669 11532963 1 (chr_length=249250621)',
# '\nSEQ pan_troglodytes 1 11517444 11668750 1 (chr_length=229974691)',
# '\nSEQ gorilla_gorilla 1 11607412 11751006 1 (chr_length=229966203)',
# '\nSEQ pongo_pygmaeus 1 218866021 219020464 -1 (chr_length=229942017)',
# '\nSEQ macaca_mulatta 1 14425463 14569832 1 (chr_length=228252215)',
# '\nSEQ callithrix_jacchus 7 45949850 46115230 1 (chr_length=155834243)']
groups[0][1] # data from segment 0
#['\nGGGGGG',
# '\nCCCCTC',
# '\nGGGGGG',
# '\nGGGGGG',
# '\nGGGGGG',
# '\nGGGGGG',
# '\nGGGGGG',
# '\nGGGGGG',
# '\nGGGGGG',
# ...

The next step in the processing I will leave up to you, so I don't steal all the fun. But hopefully this gives you a good idea on using list comprehension to optimize code.

Update

Consider the simple test case to gauge efficiency of the comprehensions combined with re:

def test1():
    with open('test.txt','r') as f:
        head = []
        for line in f:
            if line.startswith('SEQ'):
               head.append( line)
        return head

def test2():
    s = open('test.txt').read()
    head = re.findall( '\nSEQ.*', s)
    return head

%timeit( test1() )
10000 loops, best of 3: 78 µs per loop

%timeit( test2() )
10000 loops, best of 3: 37.1 µs per loop

Even if I gather additional information using re

def test3():
    s         = open('test.txt').read()
    head_info = [(s[m.start():m.end()],m.start(), m.end()) for m in re.finditer('\nSEQ.*', s)]

    head = [ h[0] for h in head_info]
    head_inds = [ (h[1],h[2]) for h in head_info]

%timeit( test3() )
10000 loops, best of 3: 50.6 µs per loop

I still get speed gains. I believe this may be faster in your case to use list comprehensions. However, the for loop might actually beat the comprehension (I take back what I said before) in end, consider

def test1(): #similar to how you are reading in the data in your for loop above
    with open('test.txt','r') as f:
        head = []
        data = []
        species = []
        species_data = False
        for line in f:
            if line.startswith('SEQ'):
                head.append( line)
                species.append( line.split()[1] )
                continue
            if 'DATA' in line and {'homo_sapiens', 'macaca_mulatta'}.issubset(species):
                species_data = True
                continue
            if species_data and '//' not in line:
                data.append( line )
                continue
            if species_data and line.startswith( '//' ):
                species_data = False
                species = []
                continue
        return head, data

def test3():
    s         = open('test.txt').read()
    head_info = [(s[m.start():m.end()],m.start(), m.end()) for m in re.finditer('\nSEQ.*', s)]
    head = [ h[0] for h in head_info]
    head_inds = [ (h[1],h[2]) for h in head_info]

    data_info = [(s[m.start():m.end()],m.start(), m.end()) for m in re.finditer('\n[AGCT-]+.*', s)]
    data = [ h[0] for h in data_info]
    data_inds = [ (h[1],h[2]) for h in data_info]

    return head,data

In this case, as the iterations become more complex, the traditional for loop wins

In [24]: %timeit(test1()) 
10000 loops, best of 3: 135 µs per loop

In [25]: %timeit(test3())
1000 loops, best of 3: 256 µs per loop

Though I can still use re.findall twice and beat the for loop:

def test4():
    s         = open('test.txt').read()
    head = re.findall( '\nSEQ.*',s )
    data = re.findall( '\n[AGTC-]+.*',s)
    return head,data

In [37]: %timeit( test4() )
10000 loops, best of 3: 79.5 µs per loop

I guess as the processing of each iteration becomes increasingly complex, the for loop will win, though there might be a more clever way to continue on with re. I wish there was a standard way to determine when to use either.

I'm not sure this is an optimization. After three `s.replace` calls, two `re.findall(..., s)` calls, and two `re.finditer(..., s)` calls, you've already iterated through the file contents seven times. Then there's more looping with each list comprehension. I would have guessed the number of times you loop matters more than the form you do it (traditional for loop, or list comprehension). In any case benchmarking is needed to know for sure — Trevor Merrifield, Jul 26 '15 at 08:47
and in the ```re``` comprehensions I am operating on an object (large string) in memory , not moving around a bunch of memory during each iteration of a for loop. But a benchmark would be nice — dermen, Jul 26 '15 at 08:55
@dermen I tend to agree with @TrevorM here. At the beginning I was considering some `re` based approach, but in my (unschooled, naive) opinion, I would think that python's basic built-in `set`, `string` and `list` operations are more efficient for this (almost) uniformly formatted data. Any opinions on this though: at what point does `re` search begin to beat out slicing, indexing and testing set membership? Still, +1 for pointing the way towards a more bulk treatment than my line-by-line function. — isosceleswheel, Jul 26 '15 at 11:33
Actually, I agree with what ya'll are saying (see above). Anyway, interesting problem indeed — dermen, Jul 27 '15 at 19:36

Vince · Answer 4 · 2015-07-27T18:22:27.187

1

Your code looks good, but there are particular things that could be improved, such as the use of map, etc.

For good guide on performance tips in Python see:

https://wiki.python.org/moin/PythonSpeed/PerformanceTips

I have used the above tips to get code working nearly as fast as C code. Basically, try to avoid for loops (use map), try to use find built-in functions, etc. Make Python work for you as much as possible by using its builtin functions, which are largely written in C.

Once you get acceptable performance you can run in parallel using:

https://docs.python.org/dev/library/multiprocessing.html#module-multiprocessing

Edit:

I also just realized you are opening a compressed gzip file. I suspect a significant amount of time is spent decompressing it. You can try to make this faster by multi-threading it with:

https://code.google.com/p/threadzip/

edited Jul 27 '15 at 18:22

answered Jul 27 '15 at 15:22

Vince

3,325
2
23
41

thanks, the first link looks like a good read, I'm checking over it right now. I was considering some kind of `map` solution or large scale `re` processing, which is why I posted but I am not sure it is worth the effort or necessarily going to save time given the slight irregularities in this data (I would be mapping a function with a lot of `if` statements regardless). Presently I am able to process ~3.3e9 lines in under an hour reading stdin directly from a url without saving any files, and its not a job I need to repeat unless I decide I want to get some more features from the data. – isosceleswheel Jul 27 '15 at 16:53
You are on the right track. Optimize until you get enough performance to get you where you need to go. Unless you want to optimize for fun, which is also OK :) – Vince Jul 27 '15 at 18:12

score 0 · Answer 5 · edited May 23 '17 at 11:52

File processing with Numpy

The data itself appears to be completely regular and can be processed easily with Numpy. The header is only a tiny part of the file and the processing speed thereof is not very relevant. So the idea is to swich to Numpy for only the raw data and, other than that, keep the existing loops in place.

This approach works best if the number of lines in a data segment can be determined from the header. For the remainder of this answer I assume this is indeed the case. If this is not possible, the starting and ending points of data segments have to be determined with e.g. str.find or regex. This will still run at "compiled C speed" but the downside being that the file has to be looped over twice. In my opinion, if your files are only 50MB it's not a big problem to load a complete file into RAM.

E.g. put something like the following under if species_data and '//' not in line:

# Define `import numpy as np` at the top

# Determine number of rows from header data. This may need some 
# tuning, if possible at all
nrows = max(ends[i]-starts[i] for i in range(len(species)))

# Sniff line length, because number of whitespace characters uncertain
fp = f.tell()
ncols = len(f.readline())
f.seek(fp)

# Load the data without loops. The file.read method can do the same,
# but with numpy.fromfile we have it in an array from the start.
data = np.fromfile(f, dtype='S1', count=nrows*ncols)
data = data.reshape(nrows, ncols)

# Process the data without Python loops. Here we leverage Numpy
# to really speed up the processing.
human   = data[:,hom]
macaque = data[:,mac]

valid = np.in1d(human, bases) & np.in1d(macaque, bases)
mismatch = (human != macaque)
pos = starts[hom] + np.flatnonzero(valid & mismatch)

# Store
div_dict[(starts[hom], ends[hom], valid.sum())] = pos

# After using np.fromfile above, the file pointer _should_ be exactly
# in front of the segment termination flag
assert('//' in f.readline())

# Reset the header containers and flags 
...

So the elif species_data and '//' in line: case has become redundant and the containers and flags can be reset in the same block as the above. Alternatively, you could also remove the assert('//' in f.readline()) and keep the elif species_data and '//' in line: case and reset containers and flags there.

Caveats

For relying on the file pointer to switch between processing the header and the data, there is one caveat: (in CPython) iterating a file object uses a read-ahead buffer causing the file pointer to be further down the file than you'd expect. When you would then use numpy.fromfile with that file pointer, it skips over data at the start of the segment and moreover it reads into the header of the next segment. This can be fixed by exclusively using the file.readline method. We can conveniently use it as an iterator like so:

for line in iter(f.readline, ''):
    ...

For determining the number of bytes to read with numpy.fromfile there is another caveat: Sometimes there is a single line termination character \n at the end of a line and other times two \r\n. The first is the convention on Linux/OSX and the latter on Windows. There is os.linesep to determine the default, but obviously for file parsing this isn't robust enough. So in the code above the length of a data line is determined by actually reading a line, checking the len, and putting back the file pointer to the start of the line.

When you encounter a data segment ('DATA' in line) and the desired species are not in it, you should be able to calculate an offset and f.seek(f.tell() + offset) to the header of the next segment. Much better than looping over data you're not even interested in!

Beyond for-looping: high performance parsing of a large, well formatted data file

5 Answers5

Update

File processing with Numpy

Caveats