Regex for IFC with array attributed

Question

IFC is a variation of STEP files used for construction projects. The IFC contains information about the building being constructed. The file is text based and it easy to read. I am trying to parse this information into a python dictionary. The general format of each line will be similar to the following

2334=IFCMATERIALLAYERSETUSAGE(#2333,.AXIS2.,.POSITIVE.,-180.);

ideally this should be parsed int #2334, IFCMATERIALLAYERSETUSAGE, #2333,.AXIS2.,.POSITIVE.,-180. I found a solution Regex includes two matches in first match https://regex101.com/r/RHIu0r/10 for part of the problem. However, there are some cases the data contains arrays instead of values as the example below

2335=IFCRELASSOCIATESMATERIAL('2ON6$yXXD1GAAH8whbdZmc',#5,$,$,(#40,#221,#268,#281),#2334);

This case need to be parsed as #2335, IFCRELASSOCIATESMATERIAL, '2ON6$yXXD1GAAH8whbdZmc', #5,$,$, [#40,#221,#268,#281],#2334 Where [#40,#221,#268,#281] is a stored in a single variable as an array The array can be in the middle or the last variable.

Would you be able to assist in creating a regular expression to obtain desired results I have created https://regex101.com/r/mqrGka/1 with cases to test

How shall the `#…` values be stored: as strings, or as numbers? — Armali, Feb 28 '20 at 09:48
In your test cases, you have two lines with `#2=…`. Are you aware that when storing _into a python dictionary_ with `#2` as the key, the first line will be lost? — Armali, Feb 28 '20 at 09:54
Do you want the quotes `'` to be preserved within the stored strings? — Armali, Feb 28 '20 at 10:44
Numbers of string are ok. the quote ' should not be preserved. However, they are important as is there are any commas , between quotes they should not be used to split text — Hassan Emam, Feb 28 '20 at 14:13
I'm not sure what you mean by _Numbers of string are ok_ - is it okay that numbers are stored as strings, or is it okay that numbers are made from strings? — Armali, Feb 28 '20 at 15:00

Armali · Accepted Answer · 2020-02-28T15:26:42.077

Here's a solution that continues from the point you reached with the regular expression in the test cases:

file = """\
#1=IFCOWNERHISTORY(#89024,#44585,$,.NOCHANGE.,$,$,$,1190720890);
#2=IFCSPACE(';;);',#1,$);some text);
#2=IFCSPACE(';;);',#1,$);
#2885=IFCRELAGGREGATES('1gtpBVmrDD_xsEb7NuFKc8',#5,$,$,#2813,(#2840,#2846,#2852,#2858,#2879));
#2334=IFCMATERIALLAYERSETUSAGE(#2333,.AXIS2.,.POSITIVE.,-180.);
#2335=IFCRELASSOCIATESMATERIAL('2ON6$yXXD1GAAH8whbdZmc',#5,$,$,(#40,#221,#268,#281),#2334);
""".splitlines()

import re
d = dict()
for line in file:
    m = re.match(r"^#(\d+)\s*=\s*([a-zA-Z0-9]+)\s*\(((?:'[^']*'|[^;'])+)\);", line, re.I|re.M)
    attr = m.group(3)       # attribute list string
    values = [m.group(2)]   # first value is the entity type name
    while attr:
        start = 1
        if attr[0] == "'": start += attr.find("'", 1)   # don't split at comma within string
        if attr[0] == "(": start += attr.find(")", 1)   # don't split item within parentheses
        end = attr.find(",", start)                     # search for a comma / end of item
        if end < 0: end = len(attr)
        value = attr[1:end-1].split(",") if attr[0] == "(" else attr[:end]
        if value[0] == "'": value = value[1:-1]         # remove quotes
        values.append(value)
        attr = attr[end+1:]                             # remove current attribute item
    d[m.group(1)] = values                              # store into dictionary

Thank you that really helped a lot. However it is a bit slow hence I was looking for a regex — Hassan Emam, Mar 02 '20 at 06:30
@Hassan Emam - If there were a large example file somewhere, it might be possible to find an optimized solution. — Armali, Mar 02 '20 at 06:54
Often the files are quite big +100k lines for a simple project and few million line for more complicated jobs. I found the performance is acceptable however, the search is a bit challenging and takes long time. I am trying different search algorithms. — Hassan Emam, Mar 03 '20 at 07:42

Regex for IFC with array attributed

2334=IFCMATERIALLAYERSETUSAGE(#2333,.AXIS2.,.POSITIVE.,-180.);

2335=IFCRELASSOCIATESMATERIAL('2ON6$yXXD1GAAH8whbdZmc',#5,$,$,(#40,#221,#268,#281),#2334);

1 Answers1

Linked