Filtering whilst using numpy.genfromtxt

Question

I have a file from which I only need to read certain values into an array. The file is divided by rows which specify a TIMESTEP value. I need the section of data following the highest TIMESTEP in the file.

The files will contain over 200,000 rows although I won't know which row the section I need begins for any given file and I won't know what the largest TIMESTEP value will be.

Am assuming that if I can find the row number of the largest TIMESTEP then I can import starting at that line. All these TIMESTEP lines begin with a space character. Any ideas on how I might proceed would be helpful.

Sample file

 headerline 1 to skip
 headerline 2 to skip
 headerline 3 to skip
 TIMESTEP =    0.00000000    
0,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
1,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
2,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
2,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
 TIMESTEP =   0.119999997    
0,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
1,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
2,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
3,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
 TIMESTEP =    3.00000000    
0,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
1,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
1,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0
2,    1.0,   1.0,    1.0,   1.0,      1.0,   1.0

Basic code

import numpy as np

with open('myfile.txt') as f_in:
  data = np.genfromtxt(f_in, skip_header=3, comments=" ")

I'd use regular Python file reading to find the correct TIMESTEP block. — hpaulj, Sep 24 '14 at 05:27
You might not even need `genfromtxt` to extract the data from the desired lines. Or load them into a `StringIO` buffer, and run `genfromtxt` on that. — hpaulj, Sep 24 '14 at 07:18
Thanks for the tip @hpaulj. I'll give that a shot. If you wanted to provide a basic example that would be awesome. :) — Carl, Sep 24 '14 at 08:35

François · Answer 1 · 2014-09-24T12:28:50.417

2

You can precisely use filter() whilst using genfromtxt(), because genfromtxt accepts generators.

with open('myfile.txt', 'rb') as f_in:
    lines = filter(lambda x: not x.startswith(b' '), f_in)
    data = genfromtxt(lines, delimiter=',')

Then in your case you don't need to skip_header.

edited Sep 24 '14 at 12:28

answered Sep 24 '14 at 10:06

François

7,988
2
21
17

Thanks! That gets the bulk of the data imported but what I really need is the section of data which comes _after_ the `TIMESTEP` line with the highest value. In this case, the section after `TIMESTEP = 3.00000000` Ideally without needing to iterate the whole file multiple times. – Carl Sep 24 '14 at 10:10
ok sorry I did not get your question. I'll post another answer. – François Sep 24 '14 at 12:31

score 1 · Accepted Answer · edited May 23 '17 at 11:44

1

What you can do is use a custom iterator.

Here is a working example:

from numpy import genfromtxt

class Iter(object):
    ' a custom iterator which returns a timestep and corresponding data '

    def __init__(self, fd):
        self.__fd = fd
        self.__timestep = None
        self.__next_timestep = None
        self.__finish = False
        for _ in self.to_next_timestep(): pass # skip header

    def to_next_timestep(self):
        ' iterate until next timestep '
        for line in self.__fd:
            if 'TIMESTEP' in line:
                self.__timestep = self.__next_timestep
                self.__next_timestep = float(line.split('=')[1])
                return
            yield line
        self.__timestep = self.__next_timestep
        self.__finish = True

    def __iter__(self): return self

    def next(self):
        if self.__finish:
            raise StopIteration
        data = genfromtxt(self.to_next_timestep(), delimiter=',')
        return self.__timestep, data

with open('myfile.txt') as fd:
    iter = Iter(fd)
    for timestep, data in iter:
        print timestep, data # data can be selected upon highest timestep

edited May 23 '17 at 11:44

Community

1
1

answered Sep 24 '14 at 12:38

François

7,988
2
21
17

I am getting an error where it won't read in the last section: UserWarning: genfromtxt: Empty input file: "" warnings.warn('genfromtxt: Empty input file: "%s"' % fname) – Carl Sep 25 '14 at 02:37
it's not an error, it's a warning :) the last section should read fine. You can easily catch the warning, see https://docs.python.org/2/library/warnings.html – François Sep 25 '14 at 09:07
Sorry, "warning" not "error". If I print the `timestep` values as I iterate though, it returns the second last `timestep` value twice, although `data` is correct. 0.0 0.119999997 0.119999997 – Carl Sep 25 '14 at 09:16
Oh yes right. Fixed :) And the warning also goes away. – François Sep 25 '14 at 09:57

hpaulj · Answer 3 · 2014-09-24T16:27:54.593

Here's a solution that uses a regular Python file read, applying genfromtxt to a list of lines. For illustration purposes I am parsing each block of data, but it could be easily modified to skip blocks that don't meet your timestep criteria.

I first wrote this with StringIO, as used in many of the genfromtxt doc examples, but all it needs is an iterable. So a list of lines works just fine.

import numpy as np
filename = 'stack26008436.txt'

def parse(tstep, block):
    print tstep
    print np.genfromtxt(block, delimiter=',')

with open(filename) as f:
    block = []
    for line in f:
        if 'TIMESTEP' in line:
            if block:
                parse(tstep, block)
            block = []
            tstep = float(line.strip().split('=')[1])
        else:
            if 'header' not in line:
                block.append(line)
    parse(tstep, block)

producing from your sample:

0901:~/mypy$ python2.7 stack26008436.py
0.0
[[ 0.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.]
 ...
 [ 3.  1.  1.  1.  1.  1.  1.]]
3.0
[[ 0.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.]
 [ 2.  1.  1.  1.  1.  1.  1.]]

Filtering whilst using numpy.genfromtxt

3 Answers3

Linked