I am trying to parse many cobol copybooks using python.
I have this regex expression that I have modified from one provided in cobol.py:
^(?P<level>\d{2})\s+(?P<name>\S+).*?
(\s+INDEXED BY\s+(?P<indexed_by>\S+))?.*?
(\s+REDEFINES\s+(?P<redefines>\S+))?.*?
(\s+PIC(TURE)?\s+(?P<pic>\S+))?.*?
(\s+OCCURS\s+(?P<occurs>\d+).?( TIMES)?)?.*?
((?P<comp>)\s+COMP\S+)?.*?
(\s+VALUE\s+(?P<value>\S+).*)?
\.$
Here is a sample of text that works for all lines except the second last line. The second last line fails to find the pic group match identified because the occurs group has already (ahem) occurred previously in the string.
03 AMOUNT-BREAKDOWN PICTURE 9(8)V99 VALUE ZEROES.
03 AMOUNT-BREAKDOWN-X REDEFINES AMOUNT-BREAKDOWN.
05 FILLER PICTURE X(3) VALUE "DEC".
03 MONTH REDEFINES MONTH-TAB PICTURE X(3) OCCURS 12 TIMES.
03 SUB PICTURE 99 VALUE 0.
03 NUMBER-HOLD.
05 NUMB-HOLD PICTURE X OCCURS 11 TIMES.
05 FILLER PICTURE X(5) VALUE "TEN".
03 DIGIT-TAB2 REDEFINES DIGIT-TAB1.
05 DIGIT-TABLE OCCURS 10 PICTURE X(5).
03 WK-TEN-MILLION PICTURE X(5) VALUE SPACES.
I struggle with regular expressions but I think I risk creating a mess because I am missing something fundamental.
To be clear: all the rows with PICTURE statements are captured by the pic group except the second last line because it comes after the occurs capture group.
Any help appreciated.