Tokenize string with field of specific length in pyparsing

Question

I'm writing a simple parser for ascii data, in which each row has to be interpreted as fields of 8-block of chars:

"""
|--1---||--2---||--3---||--4---||--5---||--6---||--7---||--8---||--9---|
GRID         119           18.27  562.33  528.87
"""

This row, should be interpreted as:

1: GRID + 4 blank spaces
2: 5 blank spaces + 119
3: 8 blank spaces
4: 3 blank spaces + 18.27
5: 2 blank spaces + 562.33
6: 2 blank spaces + 528.87
7: 8 blank spaces
8: 8 blank spaces
9: 8 blank spaces

This is what I've tried

EOL = LineEnd().suppress()
card_keyword = Keyword("GRID").leaveWhitespace().suppress()
number_card_fields = (number + ZeroOrMore(White()))
empty_card_fields = 8 * White()
card_fields = (number_card_fields | empty_card_fields)
card = (card_keyword + OneOrMore(card_fields)).setParseAction(self._card_to_dict)


def _card_to_dict(self, toks):
    _FIELDS_MAPPING = {
        0: "id", 1: "cp", 2: "x1", 3: "x2", 4: "x3", 5: "cd", 6: "ps", 7: "seid"
    }
    mapped_card = {self._FIELDS_MAPPING[idx]: token_field for idx, token_field in enumerate(toks)}
    return mapped_card

test2 = """
GRID         119           18.27  562.33  528.87                        
"""
print(card.searchString(test2))

This return

[[{'id': 119, 'cp': '           ', 'x1': 18.27, 'x2': '  ', 'x3': 562.33, 'cd': '  ', 'ps': 528.87, 'seid': '                        \n'}]]

I would like to obtain this, instead

[[{'id': 119, 'cp': '        ', 'x1': 18.27, 'x2': 562.33, 'x3': 528.87, 'cd': '        ', 'ps': '        ', 'seid': '        '}]]

I think the problem is here number_card_fields = (number + ZeroOrMore(White())). I don't know how to tell to pyparsing that this expresion must be exaclty 8 chars long.

Can someone help me?Thanks in advance for your valuable support

Do you *absolutely* have to use pyparsing? If you have fixed-size data, you may be better off using slices to pull each field out. See this answer: https://stackoverflow.com/questions/3911483/python-slice-how-to-i-know-the-python-slice-but-how-can-i-use-built-in-slice-ob/3911763#3911763 — PaulMcG, Jul 31 '19 at 22:36

PaulMcG · Answer 1 · 2022-07-09T19:37:56.167

Pyparsing allows you to specify words of an exact length. Since your lines are fixed size fields, then your "words" are made up of any printable or space character, with an exact size of 8:

field = Word(printables + " ", exact=8)

Here is a parser for your input line:

import pyparsing as pp
# clear out whitespace characters - pretty much disables whitespace skipping
pp.ParserElement.setDefaultWhitespaceChars('')

# define an expression that matches exactly 8 printable or space characters
field = pp.Word(pp.printables + " ", exact=8).setName('field')

# a line has one or more fields
parser = field[1, ...]

# try it out
line = "GRID         119           18.27  562.33  528.87"

print(parser.parseString(line).asList())

Prints:

['GRID    ', '     119', '        ', '   18.27', '  562.33', '  528.87']

I find those spaces annoying, so we can add a parse action to field to strip them:

# add a parse action to field to strip leading and trailing spaces
field.addParseAction(lambda t: t[0].strip())
print(parser.parseString(line).asList())

Now gives:

['GRID', '119', '', '18.27', '562.33', '528.87']

It looks like you expect a total of 8 fields, and you want to convert the numeric fields to float. Here is a mod to your _card_to_dict parse action:

def str_to_value(s):
    if not s:
        return None
    try:
        return float(s)
    except ValueError:
        return s

def _card_to_dict(toks):
    _FIELDS_MAPPING = {
        0: "id", 1: "cp", 2: "x1", 3: "x2", 4: "x3", 5: "cd", 6: "ps", 7: "seid"
    }
    
    # this is one way to do it, but you can just add the names to toks
    # mapped_card = {self._FIELDS_MAPPING[idx]: token_field for idx, token_field in enumerate(toks)}
    for idx, token_field in enumerate(toks):
        toks[_FIELDS_MAPPING[idx]] = str_to_value(token_field)

parser.addParseAction(_card_to_dict)
result = parser.parseString(line)

You can convert this result to a dict:

print(result.asDict())

prints:

{'cd': 528.87, 'x2': 18.27, 'id': 'GRID', 'cp': 119.0, 'x1': None, 'x3': 562.33}

If you dump the results using:

print(result.dump())

you'll get:

['GRID', '119', '', '18.27', '562.33', '528.87']
- cd: 528.87
- cp: 119.0
- id: 'GRID'
- x1: None
- x2: 18.27
- x3: 562.33

This shows how you can access the parsed result directly, without having to convert to a dict:

print(result['x2'])
print(result.id)

prints

18.27
GRID

Tokenize string with field of specific length in pyparsing

1 Answers1