Proper way to read position based text file

Question

So I have a file with data in this (standardized) format:

 12455WE READ THIS             TOO796445 125997  554777     
 22455 888AND THIS       TOO796445 125997  55477778 2 1

Probably tought up by someone who has done too much cobol.

Each field has a fixed lenght and I can read it by slicing the line.

My problem is how can I structure my code in a way that makes it more flexible and does not make me use hard-coded offsets for the slices ? Should I use a class of constants of something like that ?

EDIT:

Also the first number (0->9 always present) determines the structure of the line which is of fixed length. Also the file is provided by a 3rd party who ensures the validity so I don't need to check the format only read it. There are around 11 different line structures.

Are there 10 different possible line structures, depending on the first digit? Do the different line structures vary radically from one another? Or do only the first couple of fields vary, with all the subsequent fields remaining unchanged, as shown in your examples? — PM 2Ring, Nov 16 '15 at 09:52
they basically all change see https://www.febelfin.be/sites/default/files/Standard-CODA-2.3-EN.pdf for the format — maazza, Nov 16 '15 at 09:57
I see only one way, as suggested below, creating multiple record width structures, based on the initial 5 digit code of the record. Bear in mind that the header record contains a version number, so you may have to allow for different structures based on the version of the data or you may find that you are unable to decode an older file when the version changes. I was going to ask how difficult it would be to ask for a field separator but asking the banks to change anything, is a) pointless and b) even if they agreed would take years. — Rolf of Saxony, Nov 16 '15 at 10:10
This is starting to look complicated! I only had a brief look at that PDF, but it appears to me that the line type is determined by that leading 5 digit field, not just its first digit. BTW, questions on SO should be self-contained and not require references to eternal documents for important information (although links are welcome if they enhance the question). — PM 2Ring, Nov 16 '15 at 10:12
We certainly don't want the whole PDF in your question. But it would be good if your question mentioned roughly how many different line structures your program needs to handle. A solution that's effective for 10 line structures may not be suitable if there are many thousands of potential line structures, and vice versa. — PM 2Ring, Nov 16 '15 at 10:22

PM 2Ring · Answer 1 · 2015-11-16T13:18:09.967

My suggestion is to use a dictionary keyed on the 5 digit line type code. Each value in the dictionary can be a list of field offsets (or of (offset, width) tuples), indexed by field position.

If your fields have names it may be convenient to use a class instead of a list to store field offset data. However, namedtuples may be better here, since then you can access your field offset data either via its name or by its field position, so you get the best of both worlds.

namedtuples are actually implemented as classes, but defining a new namedtuple type is much more compact that creating an explicit class definition, and namedtuples use the __slots__ protocol, so they take up less RAM than a normal class that uses __dict__ for storing its attributes.

Here's one way to use namedtuples to store field offset data. I'm not claiming that the following code is the best way to do this, but it should give you some ideas.

from collections import namedtuple

#Create a namedtuple, `Fields`, containing all field names
fieldnames = [
    'record_type', 
    'special',
    'communication',
    'id_number',
    'transaction_code',
    'amount',
    'other',
]

Fields = namedtuple('Fields', fieldnames)

#Some fake test data
data = [
    #          1         2         3         4         5
    #012345678901234567890123456789012345678901234567890123
    "12455WE READ THIS             TOO796445 125997  554777",
    "22455 888AND THIS       TOO796445 125997  55477778 2 1",
]

#A dict to store the field (offset, width) data for each field in a record,
#keyed by record type, which is always stored at (0, 5)
offsets = {}

#Some fake record structures
offsets['12455'] = Fields(
    record_type=(0, 5), 
    special=None,
    communication=(5, 28),
    id_number=(33, 6),
    transaction_code=(40, 6),
    amount=(48, 6),
    other=None)

offsets['22455'] = Fields( 
    record_type=(0, 5),
    special=(6, 3),
    communication=(9, 18),
    id_number=(27, 6),
    transaction_code=(34, 6),
    amount=(42, 8),
    other=(51,3))

#Test.
for row in data:
    print row
    #Get record type
    rt = row[:5]
    #Get field structure
    fields = offsets[rt]
    for name in fieldnames:
        #Get field offset data by field name
        t = getattr(fields, name)
        if t is not None:
            start, flen = t
            stop = start + flen
            data = row[start : stop]            
            print "%-16s ... %r" % (name, data)
    print

output

12455WE READ THIS             TOO796445 125997  554777
record_type      ... '12455'
communication    ... 'WE READ THIS             TOO'
id_number        ... '796445'
transaction_code ... '125997'
amount           ... '554777'

22455 888AND THIS       TOO796445 125997  55477778 2 1
record_type      ... '22455'
special          ... '888'
communication    ... 'AND THIS       TOO'
id_number        ... '796445'
transaction_code ... '125997'
amount           ... '55477778'
other            ... '2 1'

@maazza: There are several examples of creating and using namedtuples in the docs I linked. — PM 2Ring, Nov 16 '15 at 11:01
@maazza: Give me some field names and I'll add some example code to my answer. If you don't want / need named fields then using classes or namedtuples is unnecessary, you might as well just use a list. — PM 2Ring, Nov 16 '15 at 11:36
@maazza: Ok, I've added some example code. I hope you find it helpful. — PM 2Ring, Nov 16 '15 at 13:18
@PM2Ring Classy answer. If this is used it should be borne in mind that there is a version number in the Header record. This should perhaps be catered for, as mentioned in my previous comment. Should the structures change, the live code may have to handle data using a previous version structure or there will have to be a different program to handle each version structure. — Rolf of Saxony, Nov 16 '15 at 18:36
Thanks, @RolfofSaxony. It certainly would be important in a real program to handle the version number. But this was just supposed to be a quick demo and I didn't want to spend time wading through the details in that PDF of CODA specs, so I just used the field names that maazza supplied, plus a couple of extras I made up to cover the data lines given in the question. — PM 2Ring, Nov 16 '15 at 19:15

score 1 · Accepted Answer · answered Nov 16 '15 at 09:30

1

Create a list of widths and a routine that accepts this and an indexed column number as parameters. The routine can calculate the start offset for your slice by adding all previous column widths, and add the width of the indexed column for the end offset.

answered Nov 16 '15 at 09:30

Jongware

22,200
8
54
100

do you think it still applies if the first number determines the structure of the line ? – maazza Nov 16 '15 at 09:39
@maaza: I don't see a problem in having different width lists. That's why I suggested a generalized routine. Of course it's entirely up to you how you select which of several width lists to use, but the routine stays the same. – Jongware Nov 16 '15 at 09:49

score 1 · Answer 3 · edited May 23 '17 at 11:48

You can have a list of widths of the columns describing the format and unfold it like this:

formats = [
    [1, ],
    [1, 4, 28, 7, 7, 7],
]

def unfold(line):
    lengths = formats[int(line[0])]
    ends = [sum(lengths[0:n+1]) for n in range(len(lengths))]
    return [line[s:e] for s,e in zip([0] + ends[:-1], ends)]

lines = [
    "12455WE READ THIS             TOO796445 125997 554777",
]

for line in lines:
    print unfold(line)

Edit: Updated the code to better match what maazza asked in the edited question. This assumes the format character is an integer, but it can easily be generalized to other format designators.

Proper way to read position based text file

3 Answers3