Python - Parse text file and create lists based on some criteria

Question

I've been searching for a solution to this question for a while without any luck. I'm wanting to use Python to read a text file and create some lists (or arrays) based on the data in the file. An example will best illustrate my goal.

Consider the following text:

NODE
1.0, 2.0
2.0, 2.0
3.0, 2.0
4.0, 2.0
ELEMENT
1, 2, 3, 4
5, 6, 7, 8
1, 2, 3, 4
1, 2, 3, 4
1, 2, 3, 4
5, 6, 7, 8
5, 6, 7, 8
5, 6, 7, 8

I would like to read through the file (ideally once as the files can be large) and once I find "NODE" take each line between "NODE" and "ELEMENT" and put into a list. Then, once I reach "ELEMENT" take each line between "ELEMENT" and some other break (maybe another "ELEMENT" or end of file, etc…) and put that into a list. For this example,it would result in two lists.

I've tried various things but they all require knowing information about the file beforehand. I'd like to be able to automate it. Thank you very much!

What have you done so far? Please post the code you have written. — sampathsris, Aug 15 '14 at 05:07
If you don't want to require any information about the file beforehand, what's the rule that tells you that you've hit a new section? — abarnert, Aug 15 '14 at 05:19
@abarnert I misspoke in my initial post, I know what sections I'm looking for (i.e. NODE or ELEMENT), just not the number of lines between each section. — Ryan James, Aug 15 '14 at 19:40
Thank you all for the different options. Dawg's solution looks like it will be most likely to do what I need to do in the big picture. — Ryan James, Aug 15 '14 at 19:51

score 4 · Accepted Answer · edited May 23 '17 at 10:30

4

With that example data, and assuming that the labels follow the pattern in your example, you can use a regex:

import re, mmap, os

def conv(s):
    try:
        return float(s)
    except ValueError:
        return s    

data_dict={}
with open(fn, 'r') as fin:
    size = os.stat(fn).st_size
    data = mmap.mmap(fin.fileno(), size, access=mmap.ACCESS_READ)
    for m in re.finditer(r'^(\w+)$([\d\s,.]+)', data, re.M):
        data_dict[m.group(1)]=[[conv(e) for e in line.split(',')] 
                        for line in m.group(2).splitlines() if line.strip()]

print data_dict

Prints:

{'NODE': [[1.0, 2.0], [2.0, 2.0], [3.0, 2.0], [4.0, 2.0]], 
 'ELEMENT': [[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0], [1.0, 2.0, 3.0, 4.0], [1.0, 2.0, 3.0, 4.0], [1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0], [5.0, 6.0, 7.0, 8.0], [5.0, 6.0, 7.0, 8.0]]}

So, how does this work:

We use mmap to apply a regex to a file
We assume that the labels are the form of ^\w+$ (ie, labels made up of letters and numbers on a line)
Then capture all the numbers and spaces following that
Create a dict with the label as the key, the parsed numbers as the list of floats following.

Done!

edited May 23 '17 at 10:30

Community

1
1

answered Aug 15 '14 at 05:33

dawg

98,345
23
131
206

@dawg, I really like this solution as I think it will provide future flexibility. That said, I have a lot to learn about mmap and regex. One immediate question regarding regex. I also have section headers that look like "*Section, Name = Section1". Mind providing more information on how to handle this with regex? Thanks! – Ryan James Aug 15 '14 at 19:55
For that exact example, use `^(\w+,\s+\w+\s*=\s*\w+)$` for the section part of the regex (not tested...) – dawg Aug 15 '14 at 20:08
Thanks again @dawg! One last question, I promise (this is getting too far from the original question). I can't seem to get your example regex to work. I think it has to do with the asterisk at the beginning of the string "*Section". How do I capture that asterisk? Thanks again! – Ryan James Aug 15 '14 at 20:33
Try `^(\*\w+,\s+\w+\s*=\s*\w+)$` – dawg Aug 15 '14 at 20:38
The only problem with this solution (which is often not a problem at all, as long as you make sure it isn't relevant) is that `mmap` can't handle huge files on 32-bit platforms. It doesn't have to read the whole file into memory, but it _does_ have to allocate page space for the whole file, and in 32-bit-land, there's only 2-4GB of page space. – abarnert Aug 18 '14 at 02:13

score 2 · Answer 2 · answered Aug 15 '14 at 05:33

If you want this to be fully general and automated, you need to come up with the rule that distinguishes section headers from rows. I'll invent one, but it's probably not the one you want, in which case my invented code won't work for you… but hopefully it will show you what you need to do, and how to get started.

def new_section(row):
    return len(row) == 1 and row[0].isalpha() and row[0].isupper()

Now, we can just group the rows by whether or not they're section headers by using itertools.groupby. If you printed out each group, you'd get something like this:

True, [['NODE']]
False, [['1.0', '2.0'], ['2.0', '2.0'], …, ]
True, [['ELEMENT']]
False, [['1.0', '2.0', '3.0', '4.0'], …, ]

We don't care about the first value in each of those, so drop it.

And we want to batch up each pair of adjacent groups into a (header, rows) pair, which we can do by zipping our iterator with itself.

And then just put it in a dict, which will look something like this:

{'NODE': [['1.0', '2.0'], ['2.0', '2.0'], …],
 'ELEMENT': [['1.0', '2.0', '3.0', '4.0'], …]}

Here's the whole thing:

import csv
import itertools

def new_section(row):
    return len(row) == 1 and row[0].isalpha() and row[0].isupper()

with open(path) as f:
    rows = csv.reader(f)
    grouped = itertools.groupby(rows, new_section)
    groups = (group for key, group in grouped)
    pairs = zip(groups, groups)
    lists = {header[0][0]: rows for header, rows in pairs}

score 0 · Answer 3 · answered Aug 15 '14 at 05:22

def getBlocks(fname):
    state = 0 
    node = []
    ele = []
    with open(fname) as f:
        for line in f:
        if "NODE" in line:
            if state == 2:
            yield (node,ele)
            node,ele = [],[]   
            state = 1
        elif state == 1 and "ELEMENT" in line:
            state = 2
        elif state == 1:
            node.append(list(map(float,line.split(","))))
        elif state == 2 and re.match("[a-zA-Z]+",line):
            yield (node,ele)
            node,ele = [],[]   
            state = 0 
        elif state == 2:
            ele.append(list(map(int,line.split(","))))
        yield (node,ele)

for node,ele in getBlocks("somefile.txt"):
    print "N:",node
    print "E:",ele

might be about what your looking for its kinda gross... im sure you can do it better

score 0 · Answer 4 · answered Aug 18 '14 at 02:19

For the simpler problem in the updated question, you really don't need regexps, or groupby, or a complex state machine, or anything beyond what a novice should be able to understand easily.

All you need to do is accumulate rows into one list until you find the row 'ELEMENT', then start accumulating rows into the other one. Like this:

import csv
result = {'NODES': [], 'ELEMENTS': []}
current = result['NODES']
with open(path) as f:
    for row in csv.reader(f):
        if row == ['NODE']:
            pass
        elif row == ['ELEMENT']:
            current = result['ELEMENTS']
        else:
            current.append(row)

Thanks @abarnert! This is a very simple, clean approach. This appears to be a viable option as well. — Ryan James, Aug 19 '14 at 03:31

Python - Parse text file and create lists based on some criteria

4 Answers4