2

I am trying to write a VBA Regex to remove comments from Teradata SQL text files.

There are two types of comments:

1 -- The occurrence of two dashes, '--' denotes the remainder of the line as comments.

2./* xxx */ Everything between '/*' and '*/' is comments. Comments of this type can be spread over 1+ lines.

The complication is with text in single quotes such as '--these dashes are in single quotes so do not denote forthcoming comments'.

I am new to Regex and trying to figure this out is proving to be beyond me. I've been thinking along the lines of negative lookahead.

Can anyone help please?

This is as far as I have got:

Type 1:

\-\-[\S \t]*$

Type 2:

/\*[\s\S]*?\*/

Hemdip
  • 410
  • 6
  • 29
John
  • 33
  • 1
  • 3
  • 1
    Can you add examples with input and expected output? – Florent B. Mar 21 '16 at 12:40
  • In general you can't do this with regular expressions. You need to use a SQL parser of some sort. Similar question here: http://stackoverflow.com/questions/9842991/regex-to-remove-single-line-sql-comments. – Ru Hasha Mar 21 '16 at 14:26
  • I'd rather use VBA usual string processing functions for this task... – dnep Mar 21 '16 at 16:48

1 Answers1

1

An algorithm based on recursive parser calls. There are several modes: comments of 3 subtypes parsing, quoted parsing and normal. Normal mode can be alternated by any other mode, that in turn become the only normal. Thus e. g. quote chars within comments and any comment chars within quoted text are ignored. Chars to be searched depend on the current mode. The source is parsed chunk by chunk, once target chars are found the mode is switched respectively, current chunk is finished and the next one begins with the next recursive call. Call stack stores transient results. After the source ends, backward process starts, and each called parser concatenates and returns it's chunk, so finally a complete code retrieved.

Here is the code:

Option Explicit

Sub RemoveComments()

    Dim strOriginal As String
    Dim strProcessed As String

    strOriginal = ReadTextFile("C:\Users\DELL\Desktop\tmp\source.sql", 0) ' -2 - System default, -1 - Unicode, 0 - ASCII
    Parse strOriginal, strProcessed, 0
    WriteTextFile strProcessed, "C:\Users\DELL\Desktop\tmp\result.sql", 0

End Sub

Sub Parse(strSrc As String, strRes As String, lngMode As Long)

    Static objRegExp As Object
    Dim strBeg As String
    Dim objMatches As Object
    Dim lngPos As Long
    Dim lngEscPos As Long
    Dim strRet As String

    If objRegExp Is Nothing Then ' initialize regexp once
        Set objRegExp = CreateObject("VBScript.RegExp")
        With objRegExp
            .Global = False
            .MultiLine = True
            .IgnoreCase = True
        End With
    End If
    strRes = ""
    If strSrc = "" Then Exit Sub ' source completed
    strBeg = "" ' preceding chunk is empty by default
    Select Case lngMode
        Case 0 ' processing normal
            With objRegExp
                .Pattern = "(\/\*)|(^[ \t]*--)|(--)|(\')"
                Set objMatches = .Execute(strSrc)
                If objMatches.Count = 0 Then
                    strRes = strSrc
                    Exit Sub ' source completed
                End If
                lngPos = objMatches(0).FirstIndex
                With objMatches(0)
                    Select Case True
                        Case .SubMatches(0) <> ""
                            lngMode = 1 ' start multiline comment
                        Case .SubMatches(1) <> ""
                            lngMode = 2 ' start whole line comment
                        Case .SubMatches(2) <> ""
                            lngMode = 3 ' start singleline comment
                        Case .SubMatches(3) <> ""
                            lngMode = 4 ' start text in quotes
                            lngPos = lngPos + 1 ' skip found quote char
                    End Select
                End With
            End With
            strBeg = Left(strSrc, lngPos)
            lngPos = lngPos + 1
        Case 1 ' processing multiline comment
            lngMode = 0 ' start normal
            lngPos = InStr(strSrc, "*/")
            If lngPos = 0 Then Exit Sub ' source completed, comment unclosed
            lngPos = lngPos + 2 ' skip comment closing char
        Case 2 ' processing whole line comment
            lngMode = 0 ' start normal
            lngPos = InStr(strSrc, vbCrLf)
            If lngPos = 0 Then Exit Sub ' source completed
            lngPos = lngPos + 2 ' skip new line char
        Case 3 ' processing singleline comment
            lngMode = 0 ' start normal
            lngPos = InStr(strSrc, vbCrLf)
            If lngPos = 0 Then Exit Sub ' source completed
        Case 4 ' processing text within quotes
            lngPos = InStr(strSrc, "'")
            If lngPos = 0 Then Exit Sub ' source completed
            If Mid(strSrc, lngPos, 2) = "''" Then ' escaped quote char ''
                strBeg = Left(strSrc, lngPos + 1) ' store preceding chunk with escaped quote char
                lngPos = lngPos + 2 ' shift next from escaped quote char
            Else
                lngMode = 0 ' start normal
                strBeg = Left(strSrc, lngPos) ' store preceding chunk with quote char
                lngPos = lngPos + 1 ' shift next from quote char
            End If
    End Select
    Parse Mid(strSrc, lngPos), strRet, lngMode ' recursive parser call
    strRes = strBeg & strRet ' concatenate preceding chunk with processed and return result

End Sub
omegastripes
  • 12,351
  • 4
  • 45
  • 96
  • Thanks for this. Works well other than where the comments are not the first characters on the line. Is there a simple change to accommodate that situation? – John Mar 22 '16 at 11:36
  • Thanks again for your help @omegastripes. Works perfectly in most cases I have tried now. There is an exception which is... /*this is a comment ---annoyingly, so is this 'also, this */ – John Mar 22 '16 at 14:27
  • @John, I've reworked algorithm according to your requirements. – omegastripes Apr 05 '16 at 20:02