Parsing a series of fixed-width files

Question

I have a series (~30) of files that are made up of rows like:

xxxxnnxxxxxxxnnnnnnnxxxnn

Where x is a char and n is a number, and each group is a different field.

This is fixed for each file so would be pretty easy to split and read with a struct or slice; however I was wondering if there's an effective way of doing it for a lot of files (with each file having different fields and lengths) without hard-coding it.

One idea I had was creating an XML file with the schema for each file, and then I could dynamically add new ones where required and the code would be more portable, however I wanted to check there are no simpler/more standard ways of doing this.

I will be outputting the data into either Redis or an ORM if this helps, and each file will only be processed once (although other files with different structures will be added at later dates).

Thanks

Possible duplicate of [_Efficient way of parsing fixed width files in Python_](http://stackoverflow.com/questions/4914008/efficient-way-of-parsing-fixed-width-files-in-python). — martineau, Jan 26 '15 at 13:27
I see why this is flagged as potential duplicate, but the question isn't about parsing 1 file, it's more about parsing a lot with different structure and making a portable solution that isn't hard-coded for each one. — user1185675, Jan 26 '15 at 14:53

score 1 · Answer 1 · answered Jan 26 '15 at 13:28

1

You could use itertools.groupby, with str.isdigit for instance (or isalpha):

>>> line = "aaa111bbb22cccc345defgh67"
>>> [''.join(i[1]) for i in itertools.groupby(line,str.isdigit)]
['aaa', '111', 'bbb', '22', 'cccc', '345', 'defgh', '67']

answered Jan 26 '15 at 13:28

fredtantini

15,966
8
49
55

score 1 · Answer 2 · edited May 23 '17 at 12:00

I think @fredtantini's answer contains a good suggestion — and here's a fleshed out way of applying it to your problem coupled with a minor variation of the code in my answer to a related question titled Efficient way of parsing fixed width files in Python:

from itertools import groupby
from struct import Struct
isdigit = str.isdigit

def parse_fields(filename):
    with open(filename) as file:
        # determine the layout of fields from the first line of the file
        firstline = file.readline().rstrip()
        fieldwidths = (len(''.join(i[1])) for i in groupby(firstline, isdigit))
        fmtstring = ''.join('{}s'.format(fw) for fw in fieldwidths)
        parse = Struct(fmtstring).unpack_from
        file.seek(0)  # rewind
        for line in file:
            yield parse(line)

for row in parse_fields('somefile.txt'):
    print(row)

Parsing a series of fixed-width files

2 Answers2