Python -- data in output file in inconvenient location

Question

I have an output file from a program that I'm running (not one of my own creation), and some of the data that I need to access is on commented (leading #) lines within the output file. The segment of the output file that I want will always start and end with the same lines, but their location relative to the beginning of the file and to each other will not always be the same.

Let's say that my output file is called output.txt. What I've tried to do for accessing the wanted lines within output.txt is the following:

data_file = open("output.txt", "r")
block = ""
found = False

for line in data_file:
    if found:
        block += line
        if line.strip() == "# This isn't the actual line either, but I want to stop here:": break
    else:
        if line.strip() == "# This isn't the actual line, but I'm making a working example:":
            found = True
            block = "# This isn't the actual line, but I'm making a working example:"

And that does indeed get me the lines that I want. However, what this leaves me with is something that I'm not sure how to use. All I want out of this are the columns of numerical values. I've thought about using the split() command, but I don't want to break block into strings... I want to keep the nice tab-delimited columns and put them into a NumPy array.

# This isn't the actual line, but I'm making a working example:
# 
#    point     c[0]        c[1]        c[2]     
# -0.473359  7161.325229    -609.475403  49128.219132   
# -0.459864  7162.047233    -102.060363  1189.270542    
# -0.404065  7160.055198     467.778393 -23832.885052   
# -0.385952  7160.708981     0.675271    2.177786   
# 
# This isn't the actual line either, but I want to stop here:

So what I ultimately need is:

a way to obtain the lines of output.txt that I want (if there is something better than what I'm doing at present);
a way to read only the lines from block that are numerical data, in such a way that they can be put into a NumPy array;
a way to accomplish 1 & 2 that (if possible) doesn't involve strings.

As a final note, I haven't been using numpy.genfromtxt() because there are also data within this file that are not behind comments (#).

Any recommendations would be appreciated.

Since you're searching for string triggers (start & stop text), I'm confused about how you think you can do this without using strings. — Prune, Mar 04 '16 at 18:36
Also, you may find [this question](http://stackoverflow.com/questions/354038/how-do-i-check-if-a-string-is-a-number-float-in-python) helpful; it gives the canonical method to check whether some input is a float value. — Prune, Mar 04 '16 at 18:38
`numpy.genfromtxt()` takes an iterable of strings as input. You can extract the lines you are interested in to a list, say `data_list`, and then use `numpy.genfromtxt(iter(data_list))` to have this parsed into a Numpy array. — Sven Marnach, Mar 04 '16 at 18:41
@Prune, I'm not sure that I *can* do this without strings. But I'd like to, if possible. I'm going to be doing this procedure on tens of thousands of files, and I'm trying to cut back on any unnecessary steps. However, I'm not a Python expert (hence my being here, asking a question), so it's possible that part of my request is rooted in ignorance, and therefore unreasonable. — Palmetto_Girl86, Mar 04 '16 at 18:42
if I understand it you are doing `block += line` to "get rid of strings". So, you generate a string per line and then concatenate into a single line. This is disasterously non scalable. You'll spend crazy amount of time expanding this string bit by bit. — tdelaney, Mar 04 '16 at 18:55
@SvenMarnach, that sounds like a really appealing option. Would I need to strip off any unnecessary lines before doing this? Or will this be able to bypass any strings that are purely text, and select only numerical values? — Palmetto_Girl86, Mar 04 '16 at 18:58
@Palmetto_Girl86 There are lots of options to `genfromtxt()`, e.g. `skip_header` and `skip_footer` to specify how many lines to skip in the beginning and end. See the [documentation](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.genfromtxt.html) for full details. — Sven Marnach, Mar 04 '16 at 19:02

score 1 · Answer 1 · answered Mar 04 '16 at 18:58

Breaking the block into strings isn't a big deal. In fact, when you read the file line by line to find your start / end conditions, that's exactly what you did. The problem when you are reading a large file is pulling the entire thing into memory before doing the processing.

numpy.genfromtxt() can process a generator and since it loads the target data line by line, its much more efficient than pre-reading everything. Here's a generator that will discard lines until it finds the ones you want and then feeds them into numpy. Its written for python 3 but should also work for 2.

import numpy

def block_reader(fp):
    for line in fp:
        if line.strip() == b"# This isn't the actual line, but I'm making a working example:":
             break
    for line in fp:
        if line.strip() == b"# This isn't the actual line either, but I want to stop here:":
            break
        line = line[2:].strip()
        if line:
            yield line

a = numpy.genfromtxt(block_reader(open('somefile.txt', 'rb')), skip_header=1)
print(a)

score 0 · Answer 2 · answered Mar 04 '16 at 19:46

Following what you already did, you can modify as follows to get what you want. Turn your code into a function that, between the beginning mark and the end mark, yield all lines that contains only numbers with eventually the '#' sign at the beginning of the line. To do that, I define two helper function that recognize a number and check if a line contains only numbers. Feed np.genfromtxt with the output of the function, see below.

import numpy as np

is_number = lambda x: x.strip('-+').replace('.','',1).isdigit()
all_number = lambda x: all(is_number(var) for var in x.split())
def read_bloc(fileName):
    with open(fileName, "rb") as data_file:
        found = False
        for line in data_file:
            if found:
                cleaned = line.strip().strip('#')
                if all_number(cleaned):
                    yield cleaned
                if line.strip() == "# This isn't the actual line either, but I want to stop here:": break
            else:
                if line.strip() == "# This isn't the actual line, but I'm making a working example:":
                    found = True
            #
#
print np.genfromtxt( read_bloc("output.txt") )

One problem is that for sign numbers, there should be no space between the sign and the number itself.

Python -- data in output file in inconvenient location

2 Answers2