1

Update (I got a biiit further...)

So my goal is to write a parser for a script which is a weird XML similar but not XML format.

<[file][][]
<[cultivation][][]
    <[string8][coordinate_system][lonlat]>
    <[list_vegetation_map_exclusion_zone][vegetation_map_exclusion_zone_list][]
    >
    <[string8][buildings_texture_folder][]>
    <[list_plant][plant_list][]
    >
    <[list_building][building_list][]
        <[building][element][0]
            <[vector3_float64][position][7.809637 46.182262 0]>
            <[float32][direction][-1.82264196872711]>
            <[float32][length][25.9434452056885]>
            <[float32][width][17.4678573608398]>
            <[int32][floors][3]>
            <[stringt8c][roof][gable]>
            <[stringt8c][usage][residential]>
        > ...

So far I got this:

def toc_parser(file_path):
# save complete file in variable
f = open(file_path, "r")
toc = f.read()
parser = OneOrMore(Word(alphas))
# exclude kommis
parser.ignore('//' + pp.restOfLine())
#exclude <>
klammern = Suppress("<")
klammernzu = Suppress(">")
eckig = Suppress("[")
eckigzu = Suppress("]")
element = Suppress("[element]")
leer = Suppress("[]")


#grammar:
nameBuilding = "building"
namePosition = "position"
nameDirection = "direction"
nameLength = "length"
nameWidth = "width"
nameFloors = "floors"
nameRoof = "roof"
nameUsage = "usage"



buildingzahl = klammern + eckig + nameBuilding + eckigzu + element +eckig + Word(nums) +eckigzu
pos = klammern + eckig + SkipTo(Literal("]")) + eckigzu + eckig + namePosition + eckigzu + eckig + Combine(Word(nums)+"."+Word(nums))+ Combine(Word(nums)+"."+Word(nums))+ Word(nums)+ eckigzu + klammernzu
direc = klammern + eckig + SkipTo(Literal("]")) + eckigzu + eckig + nameDirection + eckigzu + eckig + Combine(Optional("-")+Word(nums)+Optional("."+Word(nums)))+ eckigzu + klammernzu
leng = klammern + eckig + SkipTo(Literal("]")) + eckigzu + eckig + nameLength + eckigzu+eckig + Combine(Word(nums)+Optional("."+Word(nums)))+ eckigzu + klammernzu
widt = klammern + eckig + SkipTo(Literal("]")) + eckigzu + eckig + nameWidth + eckigzu+eckig+Combine(Word(nums)+Optional("."+Word(nums)))+ eckigzu + klammernzu
floors = klammern + eckig + SkipTo(Literal("]")) + eckigzu + eckig + nameFloors + eckigzu+eckig+Word(nums)+ eckigzu + klammernzu
roof = klammern + eckig + SkipTo(Literal("]")) + eckigzu + eckig + nameRoof + eckigzu +eckig+Word(alphas)+ eckigzu + klammernzu
usag = klammern + eckig + SkipTo(Literal("]")) + eckigzu + eckig + nameUsage+ eckigzu+eckig+Word(alphas)+ eckigzu + klammernzu

building = buildingzahl + pos +direc +leng + widt + floors + roof + usag + klammernzu

file = klammern + eckig + Literal("file") + eckigzu + leer + leer + klammern + eckig+ Literal("cultivation") +eckigzu + leer + leer
vegexcl = Literal("<[list_vegetation_map_exclusion_zone][vegetation_map_exclusion_zone_list][]") + klammernzu
coordsis = Literal("<[string8][coordinate_system][lonlat]>")
textures = Literal("<[string8][buildings_texture_folder][]>")
listPlants = Literal("<[list_plant][plant_list][]") + klammernzu
listBuildings = Literal("<[list_building][building_list][]") + OneOrMore(building) + klammernzu
listLights = Literal("<[list_light][light_list][]") + klammernzu
listAirportLights = Literal("<[list_airport_light][airport_light_list][]") + klammernzu
listXref = Literal("<[list_xref][xref_list][]") + klammernzu

fileganz = file + coordsis + vegexcl + textures + listPlants + listBuildings + listLights + listAirportLights + listXref + klammernzu + klammernzu
print(fileganz.parseString(toc))

QUESTION:

I Need to be able to overwrite certain values in the external script and figured out (here) that this is somehow how you do it but it is always entering the "else"

#define Values to be updated
valuesToUpdate = {
    "building":"home"
    ""
    }

def updateSelectedDefinitions(tokens):
    if tokens.name in valuesToUpdate:
        newVal = valuesToUpdate[tokens.name]
        return "%" % tokens.name, newVal
    else:
        raise ParseException(print("no Update definded"))

Thx so much for helping :)

Liri22
  • 13
  • 4
  • XML parsers usually parse the generic `some content` format without hardcoding the actual tag values. The generic framework of your structure is `<[type][name][value] contents...>`, where the optional contents would be recursive instances of the same `<[type][name] etcl>` format. This should be pretty straightforward to code in pyparsing in just a few lines. Then you would traverse the parsed structure to extract the "buliding" or "position" or whatever values. You might also consider making your parser convert to JSON or XML, and then use stdlib to extract your values. – PaulMcG Nov 19 '21 at 07:02
  • @PaulMcG can you elaborate how i would go about that? GIve me an example? – Liri22 Nov 19 '21 at 07:24

1 Answers1

0

Here is a quick run through.

First, we should just try to describe this format in words:

"Each entry is enclosed in '<>' characters, and contains 3 values in '[]' characters, followed by zero or more nested entries. The 3 values in '[]'s contain a data type, an optional name, and an optional value or values. The values could be numbers or strings, and might be parsed as scalar or list values depending on the data type."

Converting this to a quasi-BNF, where '*' is used for "zero or more":

entry ::= '<' subentry subentry subentry entry* '>'
subentry ::= '[' value* ']'
value ::= number | alphanumeric word

We can see that this is a recursive grammar, since entry can contain elements that are also entry. So when we convert to pyparsing, we will define entry as a placeholder using a pyparsing Forward, and then define its structure once all the other expressions are defined.

Converting this short BNF to pyparsing:

# define some basic punctuation - useful at parse time, but we will
# suppress them since we don't really need them after parsing is done
# (we'll use pyparsing Groups to capture the structure that these 
# characters represent)
LT, GT, LBRACK, RBRACK = map(pp.Suppress, "<>[]")

# define our placeholder for the nested entry
entry = pp.Forward()

# work bottom-up through the BNF
value = pp.pyparsing_common.number | pp.Word(pp.alphas, pp.alphanums+"_")
subentry = pp.Group(LBRACK - value[...] + RBRACK)
type_name_value = subentry*3
entry <<= pp.Group(LT
                   - type_name_value("type_name_value") 
                   + pp.Group(entry[...])("contents") + GT)

At this point, you can use entry to parse your sample text (after adding enough closing '>'s to make it a valid nested expression):

result = entry.parseString(sample)
result.pprint()

Prints:

[[['file'],
  [],
  [],
  [[['cultivation'],
    [],
    [],
    [[['string8'], ['coordinate_system'], ['lonlat'], []],
     [['list_vegetation_map_exclusion_zone'],
      ['vegetation_map_exclusion_zone_list'],
      [],
      []],
     [['string8'], ['buildings_texture_folder'], [], []],
     [['list_plant'], ['plant_list'], [], []],
     [['list_building'],
      ['building_list'],
      [],
      [[['building'],
        ['element'],
        [0],
        [[['vector3_float64'], ['position'], [7.809637, 46.182262, 0], []],
         [['float32'], ['direction'], [-1.82264196872711], []],
         [['float32'], ['length'], [25.9434452056885], []],
         [['float32'], ['width'], [17.4678573608398], []],
         [['int32'], ['floors'], [3], []],
         [['stringt8c'], ['roof'], ['gable'], []],
         [['stringt8c'], ['usage'], ['residential'], []]]]]]]]]]]

So this is a start. We can see that the values are parsed, with values being parsed in the proper types.

To convert these pieces into a more coherent structure, we can attach a parse action to entry, which will be a parse-time callback as each entry gets parsed.

In this case, we will write a parse action that will process the type/name/value triple, and then capture the nested contents if present. We'll try to infer from the data type string how to structure the value or contents.

def convert_entry_to_dict(tokens):
    # entry is wrapped in a Group, so ungroup to get the parsed elements
    parsed = tokens[0]

    # unpack data type, optional name and optional value
    data_type, name, value = parsed.type_name_value
    data_type = data_type[0] if data_type else None
    name = name[0] if name else None

    # save type and name in dict to be returned from the parse action
    ret = {'type': data_type, 'name': name}

    # if there were contents present, save them as the value; otherwise,
    # get the value from the third element in the triple (use the
    # parsed data type as a hint as to whether the value should be a 
    # scalar, a list, or a str)
    if parsed.contents:
        ret["value"] = list(parsed.contents)
    else:
        if data_type.startswith(("vector", "list")):
            ret["value"] = [*value]
        else:
            ret["value"] = value[0] if value else None
            if ret["value"] is None and data_type.startswith("string"):
                ret["value"] = ""

    return ret

entry.addParseAction(convert_entry_to_dict)

Now when we parse the sample, we get this structure:

[{'name': None,
  'type': 'file',
  'value': [{'name': None,
             'type': 'cultivation',
             'value': [{'name': 'coordinate_system',
                        'type': 'string8',
                        'value': 'lonlat'},
                       {'name': 'vegetation_map_exclusion_zone_list',
                        'type': 'list_vegetation_map_exclusion_zone',
                        'value': []},
                       {'name': 'buildings_texture_folder',
                        'type': 'string8',
                        'value': ''},
                       {'name': 'plant_list',
                        'type': 'list_plant',
                        'value': []},
                       {'name': 'building_list',
                        'type': 'list_building',
                        'value': [{'name': 'element',
                                   'type': 'building',
                                   'value': [{'name': 'position',
                                              'type': 'vector3_float64',
                                              'value': [7.809637,
                                                        46.182262,
                                                        0]},
                                             {'name': 'direction',
                                              'type': 'float32',
                                              'value': -1.82264196872711},
                                             {'name': 'length',
                                              'type': 'float32',
                                              'value': 25.9434452056885},
                                             {'name': 'width',
                                              'type': 'float32',
                                              'value': 17.4678573608398},
                                             {'name': 'floors',
                                              'type': 'int32',
                                              'value': 3},
                                             {'name': 'roof',
                                              'type': 'stringt8c',
                                              'value': 'gable'},
                                             {'name': 'usage',
                                              'type': 'stringt8c',
                                              'value': 'residential'}]}]}]}]}]

If you need to rename any field names, you can add that behavior in the parse action.

That should give you a good start to process your markup.

PaulMcG
  • 62,419
  • 16
  • 94
  • 130