2

Given some VBA code (stored in text files) that makes calls to rules based on conditions, I'd like to parse the code out and create mini code snips of everything you need to know to generate that rule (to make understanding the essence of the rule easier).

I started attempting in python to regex some of the rules, but don't want to recreate the wheel if it exists. I know of examples like Retrieving JSON objects from a text file (using Python) to override a base class to create a customer parser. I'm not sure if there is any package best suited to start with for this, and haven't had any luck finding one.

Background is that there are ~5000 rules that I want to "shrink" like this, to more simply state the logics around the rules, see things how many rules are impacted by a certain variable, etc.

Input:

Sub One(position As Long)    
    Dim y As Long    
    With TEMP_ARRAY(position)
        'Comments
        If .VAR_A = "A" And .VAR_B = "B" And .VAR_C = "C" Then
            Call Some_Rule("Rule 1")
        End If

        'More Comments
        If IsEmpty(.SUB_ARRAY) Then
            Call Some_Rule("Rule 2")
        Else
            If .VAR_A = 2 Then
                If .VAR_B <> "" Then
                    'Some more comments
                    For y = 0 To UBound(.SUB_ARRAY)
                        If .SUB_ARRAY(y, 2) = 1 Or .SUB_ARRAY(y, 2) = 2 Then Exit For
                    Next y
                    If y = UBound(.SUB_ARRAY, 1) + 1 Then
                        Call Some_Rule("Rule 3")
                    End If
                Else
                    'Still more comments
                    Select Case .SUB_ARRAY(0, 2)
                        Case 3
                            Call Some_Rule("Rule 4")
                        Case 4
                            Call Some_Rule("Rule 5")
                    End Select
                End If
            End If
        End If
    End With
End Sub

Desired Output:

## RULE 1
Sub One(position As Long)
    With TEMP_ARRAY(position)
        'Comments
        If .VAR_A = "A" And .VAR_B = "B" And .VAR_C = "C" Then
            Call Some_Rule("Rule 1")
        End If
    End With
End Sub

## RULE 2
Sub One(position As Long)
    With TEMP_ARRAY(position)
        'More Comments
        If IsEmpty(.SUB_ARRAY) Then
            Call Some_Rule("Rule 2")
        End If
    End With
End Sub

## RULE 3
Sub One(position As Long)
    Dim y As Long
    With TEMP_ARRAY(position)
        'More Comments
        If IsEmpty(.SUB_ARRAY) Then
        Else
            If .VAR_A = 2 Then
                If .VAR_B <> "" Then
                    'Some more comments
                    For y = 0 To UBound(.SUB_ARRAY)
                        If .SUB_ARRAY(y, 2) = 1 Or .SUB_ARRAY(y, 2) = 2 Then Exit For
                    Next y
                    If y = UBound(.SUB_ARRAY, 1) + 1 Then
                        Call Some_Rule("Rule 3")
                    End If
                End If
            End If
        End If
    End With
End Sub

## RULE 4
Sub One(position As Long)
    With TEMP_ARRAY(position)
        'More Comments
        If IsEmpty(.SUB_ARRAY) Then
        Else
            If .VAR_A = 2 Then
                If .VAR_B <> "" Then
                Else
                    'Still more comments
                    Select Case .SUB_ARRAY(0, 2)
                        Case 3
                            Call Some_Rule("Rule 4")
                    End Select
                End If
            End If
        End If
    End With
End Sub

## RULE 5
Sub One(position As Long)
    With TEMP_ARRAY(position)
        'More Comments
        If IsEmpty(.SUB_ARRAY) Then
        Else
            If .VAR_A = 2 Then
                If .VAR_B <> "" Then
                Else
                    'Still more comments
                    Select Case .SUB_ARRAY(0, 2)
                        Case 4
                            Call Some_Rule("Rule 5")
                    End Select
                End If
            End If
        End If
    End With
End Sub

Edit: Here's what I've done so far (more code to it but this is the core of it). Basically, find a line starting with "Some_Rule" (using regex) then call this function starting in upwards direction. When it finds an open tag, it changes directions and starts looking for its closing tag, then picks up where it left off going up again, etc. I successfully get rule 1 this way and then it was 4am so i went to sleep :) ... i'm tossing things around at this point so still really sloppy but wanted to update on my progress

def compile_rule(lines, j, direction, statement_open=False):
    """
    lines           : total lines in file
    j               : current position
    direction       : 1 is up, -1 is down
    statement_open  : vba syntax not yet closed ex: if without end if
    """
    global rule
    j -= direction
    if line_type(lines[j]) in [0, 3] and not statement_open:
        rule.append(lines[j], j, direction)
    elif line_type(lines[j]) == 1 and not statement_open:
        rule.append(lines[j], j, direction)
        rule.start_looking_for(line_check(lines[j]))
        statement_open = True
        direction *= -1
    elif line_type(lines[j]) == 2 and rule.looking_for() == line_check(lines[j]) and statement_open:
        rule.append(lines[j], j, direction)
        statement_open = False
        direction *= -1
    else:
        rule.set_position(j, direction)
    if (j > 0 and j < len(lines) - 1) or (j == 0 and statement_open):
        compile_rule(lines, rule.get_position(direction), direction, statement_open)
Community
  • 1
  • 1
NikT
  • 1,590
  • 2
  • 16
  • 29
  • So you got 1 rule to work by hacking, with 4999 to go? – Ira Baxter Nov 05 '16 at 21:13
  • Why is "Python" a fundamental part of the question? – Ira Baxter Nov 06 '16 at 05:17
  • no reason -- thought there might have been some simpler python parser available. the rules will be (potentially/eventually) moving from VBA to Python so that data analysts with python experience can maintain the rules. thought it might make it easier to move if the rules were made individual -- it wasn't a requirement so didn't intend to spend much time on it. – NikT Nov 07 '16 at 04:26
  • I highly doubt somebody has built a serious VBA parser in Python. So your real problem is to convert the VBA to Python? What you are asking for is a VBA to Python translator, unless you want to do it by hand. – Ira Baxter Nov 07 '16 at 04:37
  • Yeah seems to be the case ... and unfortunately I think that part has to be by hand. It's a collection of rules built over the last 7 years (by a number of different people) and will require a lot of consolidation and cleanup. – NikT Nov 07 '16 at 04:45
  • I assume that these rules are hard-won business knowledge. You don't want to rewrite them by hand (because you'll get to debug them again) if you can get automation to do it *reliably*. If you are serious about this, my company provides tools (or services) to do exactly these kind of things. See bio for contact information. – Ira Baxter Nov 07 '16 at 05:04

1 Answers1

1

If your VBA code is written the way most Excel code seems to be written, e.g., sloppily, you won't be able to rely on the existing VBA code to have any nice, neat, easily decodable structure. If it isn't very consistent, you'll have no chance of using ad hoc/regex based techniques on it.

In that case, you'll need a full VBA parser, not some bad approximation of one. You can't write a decent parser with Regexes (well known fact). If you use a more reasonable parsing technology, you'll have a hard time defining the grammar since it is very poorly documented.

The lesson here is you probably don't want to try to write your own parser. You are better off finding one that works, and using that. I doubt you will find easily find one, let alone one that happens to be in Python; but I could be wrong.

My company happens to have such a parser for VB6, (which is essentially VBA) built on top of our DMS Software Reengineering Toolkit. This parser has been tested on millions of lines of code. It automatically builds abstract syntax trees representing the program.

To do what you want to do, you must:

  • Parse the source code
  • Find rule invocations
  • Discover the control flow path from the function start to the rule invocation
  • Compose the conditionals along that path

Given that VBA has goto statements (your example code does not, but with 5000 instances that seems like a pretty bad bet), you will need to construct the control flow graph of the code so that you can navigate along it to find the path from function entry to rule invocation. DMS provides support for finding patterns in code, for constructing control and data flow graphs, for walking along flow graphs, and for composing valid program chunks out of subtrees of the original code. You are pretty likely to need all of this.

[Based on your examples, which all appear to be valid VBA subroutines and not just some conjunction of conditionals, it appears you want to compute backward slices from each rule invocation. That requires all the machinery I've described above.]

This is not a task for the faint of heart, nor is it likely to be a weekend exercise. But it is likely to be viable.

Given that your goal is vaguely "to make them more readable", you might not be able to justify the engineering effort to do this right.

If you insist on an ad hoc approach, you probably won't finish, and you may find out that your "more readable" version of the rules, if you get them, aren't the actual truth and your readers will revolt. Nobody wants to waste time reading things that might be wrong.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341