Looking for The right way with Regular Expression with groups in different order

Question

I am trying to parse many cobol copybooks using python.

I have this regex expression that I have modified from one provided in cobol.py:

^(?P<level>\d{2})\s+(?P<name>\S+).*?
(\s+INDEXED BY\s+(?P<indexed_by>\S+))?.*?
(\s+REDEFINES\s+(?P<redefines>\S+))?.*?
(\s+PIC(TURE)?\s+(?P<pic>\S+))?.*?
(\s+OCCURS\s+(?P<occurs>\d+).?( TIMES)?)?.*?
((?P<comp>)\s+COMP\S+)?.*?
(\s+VALUE\s+(?P<value>\S+).*)?
\.$

Here is a sample of text that works for all lines except the second last line. The second last line fails to find the pic group match identified because the occurs group has already (ahem) occurred previously in the string.

03  AMOUNT-BREAKDOWN        PICTURE 9(8)V99  VALUE ZEROES.
03  AMOUNT-BREAKDOWN-X REDEFINES AMOUNT-BREAKDOWN.
05  FILLER              PICTURE X(3)     VALUE "DEC".
03  MONTH REDEFINES MONTH-TAB  PICTURE X(3) OCCURS 12 TIMES.
03  SUB                 PICTURE 99    VALUE 0.
03  NUMBER-HOLD.
05  NUMB-HOLD       PICTURE X  OCCURS 11 TIMES.
05  FILLER              PICTURE X(5)     VALUE "TEN".
03  DIGIT-TAB2 REDEFINES DIGIT-TAB1.
05  DIGIT-TABLE         OCCURS 10   PICTURE X(5).
03  WK-TEN-MILLION          PICTURE X(5)     VALUE SPACES.

I struggle with regular expressions but I think I risk creating a mess because I am missing something fundamental.

To be clear: all the rows with PICTURE statements are captured by the pic group except the second last line because it comes after the occurs capture group.

Any help appreciated.

Yes, you are missing something fundamental. And that is never to parse source code with regular expressions. Use a parser. (There seems to be at least one COBOL Copybook parsers for Python - https://github.com/balloob/Python-COBOL) — Tomalak, Oct 15 '17 at 01:56
@Tomalak, this code is borrowed from that module. It was incomplete and didn't cover all cases so I have modified it. But this is exactly how it is done in that module. — Alan, Oct 15 '17 at 01:57
Not really. Regex is unsuitable for this job, and you are seeing why already. Look for other options to convert that input into a tree structure that you can access from Python. Related thread: https://stackoverflow.com/questions/17567699/is-there-a-python-library-to-parse-and-manipulate-cobol-code - one answer there suggests using a tool to convert it to XML and then using the XML in Python, because for that Python does have tools. — Tomalak, Oct 15 '17 at 02:01
If you can write down in abstract terms the grammar that your input files conform to (maybe it's not too much work because you only expect a limited subset of the whole thing) then using a [parser generator](https://wiki.python.org/moin/LanguageParsing) would be an option. — Tomalak, Oct 15 '17 at 02:06
I appreciate your help @Tomalak - I will take a look at these. — Alan, Oct 15 '17 at 02:09
You can weasel though with regex if things are really simplistic, but any regex solution will fall apart every time you get valid inputs which you did not anticipate. To answer your initial question - regex group order is fixed. If your input is in a different order then you need a new regex that matches that order. This gets tedious very fast and every time you make fixes the regex becomes more and more unmaintainable. — Tomalak, Oct 15 '17 at 02:14
Have a look at `cb2xml` (https://sourceforge.net/projects/cb2xml/). It is `java` program that will convert a cobol copybook to Xml. cb2xml also calculates position / lengths for Mainframe cobol copybooks. There is a basic example of processing the Xml written in python — Bruce Martin, Oct 15 '17 at 04:42
If you need to do it in Python; `cb2xml` is written with SableCC. There is a python version of SableCC. You could pick up the scc (Cobol syntax file) from cb2xml and generate a python version — Bruce Martin, Oct 15 '17 at 04:46
Finally the Cobol is invalid, DIGIT-TAB2 redefines DIGIT-TAB1 which does not exist — Bruce Martin, Oct 15 '17 at 04:51
Thanks Bruce, I just took a snapshot to provide some sample lines. There are many lines, and about 1000 files. I am trying the java xml approach you suggested. New to java, but I'll have a look. and If I can do it in python all the better. Thanks again. — Alan, Oct 15 '17 at 05:38
You can use cb2xml to convert the cobol to Xml and then do it all in python. Also you could use jython (python 2.7). by the way why are you parsing the Cobol ?? — Bruce Martin, Oct 15 '17 at 10:44
@BruceMartin your cb2xml library worked with the file concerned (to get it to XML) - thank you very much. I will look to the other files I have. I am reading hexdumps from the isam files and bringing them into a RDBMS. — Alan, Oct 16 '17 at 02:36
I have put the details of cb2xml in an answer + added some extra info — Bruce Martin, Oct 16 '17 at 04:02

score 1 · Answer 1 · answered Oct 18 '20 at 20:24

PyParsing (https://github.com/pyparsing/pyparsing) is a good module to easily build grammars. You can build a basic Copybook grammar and parse it using PyParsing. You would have to then post process to retain the tree-like structure that is represented by the two-digit level fields.

Also take a look at the Copybook package (https://github.com/zalmane/copybook) which uses PyParsing.

jq170727 · Answer 2 · 2017-10-15T02:22:01.813

Although an actual parser like PLY or parsely would be best for this if you have to use regex can't you just add another OCCURS group with a different key?. e.g.

"""
03  AMOUNT-BREAKDOWN        PICTURE 9(8)V99  VALUE ZEROES.
03  AMOUNT-BREAKDOWN-X REDEFINES AMOUNT-BREAKDOWN.
05  FILLER              PICTURE X(3)     VALUE "DEC".
03  MONTH REDEFINES MONTH-TAB  PICTURE X(3) OCCURS 12 TIMES.
03  SUB                 PICTURE 99    VALUE 0.
03  NUMBER-HOLD.
05  NUMB-HOLD       PICTURE X  OCCURS 11 TIMES.
05  FILLER              PICTURE X(5)     VALUE "TEN".
03  DIGIT-TAB2 REDEFINES DIGIT-TAB1.
05  DIGIT-TABLE         OCCURS 10   PICTURE X(5).
03  WK-TEN-MILLION          PICTURE X(5)     VALUE SPACES.
"""
import re
for line in __doc__.split("\n"):
    if len(line) < 1: continue
    m = re.match(
        "^(?P<level>\d{2})\s+(?P<name>\S+).*?"
        "(\s+INDEXED BY\s+(?P<indexed_by>\S+))?.*?"
        "(\s+REDEFINES\s+(?P<redefines>\S+))?.*?"
        "(\s+OCCURS\s+(?P<occurs1>\d+).?( TIMES)?)?.*?"   # <-- occurs1
        "(\s+PIC(TURE)?\s+(?P<pic>\S+))?.*?"
        "(\s+OCCURS\s+(?P<occurs>\d+).?( TIMES)?)?.*?"
        "((?P<comp>)\s+COMP\S+)?.*?"
        "(\s+VALUE\s+(?P<value>\S+).*)?"
        "\.$", line)
    if m:
        print m.groups()

Try it online!

Sample output:

('03', 'AMOUNT-BREAKDOWN', None, None, None, None, None, None, None, '        PICTURE 9(8)V99', 'TURE', '9(8)V99', None, None, None, None, None, '  VALUE ZEROES', 'ZEROES')
('03', 'AMOUNT-BREAKDOWN-X', None, None, ' REDEFINES AMOUNT-BREAKDOWN', 'AMOUNT-BREAKDOWN', None, None, None, None, None, None, None, None, None, None, None, None, None)
('05', 'FILLER', None, None, None, None, None, None, None, '              PICTURE X(3)', 'TURE', 'X(3)', None, None, None, None, None, '     VALUE "DEC"', '"DEC"')
('03', 'MONTH', None, None, ' REDEFINES MONTH-TAB', 'MONTH-TAB', None, None, None, '  PICTURE X(3)', 'TURE', 'X(3)', ' OCCURS 12 ', '12', None, None, None, None, None)
('03', 'SUB', None, None, None, None, None, None, None, '                 PICTURE 99', 'TURE', '99', None, None, None, None, None, '    VALUE 0', '0')
('03', 'NUMBER-HOLD', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None)
('05', 'NUMB-HOLD', None, None, None, None, None, None, None, '       PICTURE X', 'TURE', 'X', '  OCCURS 11 ', '11', None, None, None, None, None)
('05', 'FILLER', None, None, None, None, None, None, None, '              PICTURE X(5)', 'TURE', 'X(5)', None, None, None, None, None, '     VALUE "TEN"', '"TEN"')
('03', 'DIGIT-TAB2', None, None, ' REDEFINES DIGIT-TAB1', 'DIGIT-TAB1', None, None, None, None, None, None, None, None, None, None, None, None, None)
('05', 'DIGIT-TABLE', None, None, None, None, '         OCCURS 10 ', '10', None, '  PICTURE X(5)', 'TURE', 'X(5)', None, None, None, None, None, None, None)
('03', 'WK-TEN-MILLION', None, None, None, None, None, None, None, '          PICTURE X(5)', 'TURE', 'X(5)', None, None, None, None, None, '     VALUE SPACES', 'SPACES')

I thought of that and thought it would get ugly. A parser sounds like a better way to go. Thanks. — Alan, Oct 15 '17 at 02:38

Bruce Martin · Accepted Answer · 2018-01-20T09:49:37.720

cb2xml

You should look at cb2xml. It will parse a Cobol Copybook and create a Xml file. You can then process the Xml in python or any language. The cb2xml package has basic examples of processing the Xml in python + other languages.

Cobol:

   01 Ams-Vendor.
       03 Brand               Pic x(3).
       03 Location-details.
          05 Location-Number  Pic 9(4).
          05 Location-Type    Pic XX.
          05 Location-Name    Pic X(35).
       03 Address-Details.
          05 actual-address.
             10 Address-1     Pic X(40).
             10 Address-2     Pic X(40).
             10 Address-3     Pic X(35).
          05 Postcode         Pic 9(4).
          05 Empty            pic x(6).
          05 State            Pic XXX.
       03 Location-Active     Pic X.

Output from cb2xml:

?xml version="1.0" encoding="UTF-8" standalone="no"?>
<copybook filename="cbl2xml_Test110.cbl">
    <item display-length="173" level="01" name="Ams-Vendor" position="1" storage-length="173">
        <item display-length="3" level="03" name="Brand" picture="x(3)" position="1" storage-length="3"/>
        <item display-length="41" level="03" name="Location-details" position="4" storage-length="41">
            <item display-length="4" level="05" name="Location-Number" numeric="true" picture="9(4)" position="4" storage-length="4"/>
            <item display-length="2" level="05" name="Location-Type" picture="XX" position="8" storage-length="2"/>
            <item display-length="35" level="05" name="Location-Name" picture="X(35)" position="10" storage-length="35"/>
        </item>
        <item display-length="128" level="03" name="Address-Details" position="45" storage-length="128">
            <item display-length="115" level="05" name="actual-address" position="45" storage-length="115">
                <item display-length="40" level="10" name="Address-1" picture="X(40)" position="45" storage-length="40"/>
                <item display-length="40" level="10" name="Address-2" picture="X(40)" position="85" storage-length="40"/>
                <item display-length="35" level="10" name="Address-3" picture="X(35)" position="125" storage-length="35"/>
            </item>
            <item display-length="4" level="05" name="Postcode" numeric="true" picture="9(4)" position="160" storage-length="4"/>
            <item display-length="6" level="05" name="Empty" picture="x(6)" position="164" storage-length="6"/>
            <item display-length="3" level="05" name="State" picture="XXX" position="170" storage-length="3"/>
        </item>
        <item display-length="1" level="03" name="Location-Active" picture="X" position="173" storage-length="1"/>
    </item>
</copybook>

An interesting application of cb2xml is described in Dynamically Reading COBOL Redefines with C#

CobolToCsv

The CobolToCsv package will convert a Cobol-Data-File to a Csv file. Limitations:

Redefines / Multi-Record files are not handled
Fairly limited range of Cobol Compilers support (Mainframe, Gnu Cobol, Fujitsu-Cobol).

Cobol2Csv should be able handle Text files (+ Comp-3). It may handle some of your files.

Looking for The right way with Regular Expression with groups in different order

3 Answers3

cb2xml

CobolToCsv

Linked