0

I have a regex pattern that works perfectly in Python and various other languages, but is failing to capture the sub matches I need for my implementation in a VBScript regex (the engine of which is apparently almost identical to JavaScript). The pattern in question is as follows:

"Sincerely,[\s\n]+([\w\.]+)\s+(\w+)\s+(.+)[\s\n]+(\d+\s.+)[\s\n]+(.+)"

An example test case is as follows:

email received 3/30/17:

Dear Sir,

Hello

Sincerely,

Mr. Robert Thomas
1104 Madison Avenue
New York, NY 10021


email received 3/30/17:

Dear Sir,

Hello

Sincerely,

Ms. Angela Carraway
402 Arlington Drive
Concord, MA 01742

The objective is for a global regex that extracts 5 subgroups out of this example match after a variable keyword which here is "Sincerely,". The subgroups should be Ms. (1st subgroup), Angela (second subgroup), Carraway (third subgroup), 402 Arlington Drive (fourth subgroup), Concord, MA 01742 (fifth subgroup). In Python, it matches the 5 groups perfectly in a Regex tester, yet for VBScript (the JavaScript engine) it matches the entire string as a match, but with no subgroups at all. Therefore when I call the sub matches in an Excel VBA macro to write to a cell, I get all of the text jumbled up into a couple cells. What am I doing wrong? Is there some character that I am missing that is disabling capturing subgroups? If so what is the critical difference between these two engines so that I can avoid this in the future and how could one fix this pattern in this test case? I've tried reading about the differences online, yet everything said seems to be only small differences that should cause the issue I am having. Any help would be greatly appreciated because I cannot seem to isolate the difference/problem. Thank you!

Edit: The following is the VBA code that utilizes the regex:

Sub regex()
    Dim docxinput As String
    Dim keyword As Variant
    Dim patterninput As Variant
    Dim pattern As String
    Dim regex As New RegExp

    docxinput = Application.GetOpenFilename(Title:="Step #1: Enter Word Document Input File Name")
        Dim wrdApp As Word.Application
        Dim wrdDoc As Word.Document
        Dim strInput As String

        Set wrdApp = CreateObject("Word.Application")
        wrdApp.Visible = False

        Set wrdDoc = wrdApp.Documents.Open(docxinput)
        strInput = wrdDoc.Range.Text

        Debug.Print (strInput)
        wrdDoc.Close 0
        Set wrdDoc = Nothing
        wrdApp.Quit
        Set wrdApp = Nothing

    pattern = "Sincerely,[\s\n]+([\w\.]+)\s+(\w+)\s+(.+)[\s\n]+(\d+\s.+)[\s\n]+(.+)"

    Dim objMatches As MatchCollection

    With regex
        .Global = True
        .MultiLine = True
        .IgnoreCase = False
        .pattern = pattern
    End With

    Set objMatches = regex.Execute(strInput)
    Dim row As Variant

    Dim SubMatches As Variant
    row = 2
    For Each SubMatches In objMatches
        Cells(row, 1).Value = objMatches(0).SubMatches(0)
        Cells(row, 2).Value = objMatches(0).SubMatches(1)
        Cells(row, 3).Value = objMatches(0).SubMatches(2)
        Cells(row, 4).Value = objMatches(0).SubMatches(3)
        Cells(row, 5).Value = objMatches(0).SubMatches(4)
        row = row + 1
    Next
End Sub

This is a picture of the results. As you can see, The first two subgroups work but then the regex (or at least I think) runs into grouping error and dumps almost of the other content into the next column. It then moves onto the fourth column, running into errors there as well. Is this an issue with the code iterating or the regex itself. I have tried to troubleshoot the code and cannot find reasons why it cannot break the text up correctly other than the regex being at fault. Any thoughts?

The picture: Screenshot of VBA Regex Issue

J. Squillaro
  • 155
  • 2
  • 13
  • Could you post the VBA code that you're using the execute the regex and retrieve the submatches? – Rich Holton May 07 '17 at 16:01
  • @RichHolton I have made edits to the original post that shows the code and further commentary along with a picture of the results I get when running the VBA. Any thoughts? – J. Squillaro May 07 '17 at 16:55
  • I suspect that your text has something other than \n separating some of the lines, so the third group (.*) captures too much, which then throws the rest off. Can you check to see? – Rich Holton May 07 '17 at 18:27
  • @RichHolton I think you re right as with the modified code form Plirkee below, it works great, yet this Word doc part causes to err. I think possibly the strInput = wrdDoc.Range.Text does something that adds or subtracts characters causing the regex to fail. Is there a more elegant way to perform a regex on the text without running the associated risks of mashing it into on string? – J. Squillaro May 07 '17 at 20:41
  • I *think* that Word uses \r for paragraph marks, and \n for line breaks. So try replacing [\s\n] with [\s\r\n]. – Rich Holton May 07 '17 at 21:01

1 Answers1

1

Your regex should run with VBA with no problems... (tested it here)

To get the desired group in vba take a look here how-to-use-regular-expressions-regex-in-microsoft-excel-both-in-cell-and-loops.

Edit: For the following input:

email received 3/30/17:

Dear Sir,

Hello

Sincerely,

Mr. Robert Thomas
1104 Madison Avenue
New York, NY 10021


email received 3/30/17:

Dear Sir,

Hello

Sincerely,

Ms. Angela Carraway
402 Arlington Drive
Concord, MA 01742

which was put inside the cell A1

and the vba code:

(note that I had to alter your for each loop - so that this would work for multiple matches)

Sub myregex()
    Dim keyword As Variant
    Dim patterninput As Variant
    Dim pattern As String
    Dim regex As New RegExp

    Set Myrange = ActiveSheet.Range("A1:A1")
   For Each C In Myrange
   strInput = C.Value
   strPattern = "Sincerely,[\s\n]+([\w\.]+)\s+(\w+)\s+(.+)[\s\n]+(\d+\s.+)[\s\n]+(.+)"

     With regex
                .Global = True
                .MultiLine = True
                .IgnoreCase = False
                .pattern = strPattern
            End With
            If regex.Test(strInput) Then
                 Set objMatches = regex.Execute(strInput)
                 row = 2
                 For Each SubMatches In objMatches
                 Cells(row, 1).Value = objMatches(row - 2).SubMatches(0)
                 Cells(row, 2).Value = objMatches(row - 2).SubMatches(1)
                 Cells(row, 3).Value = objMatches(row - 2).SubMatches(2)
                 Cells(row, 4).Value = objMatches(row - 2).SubMatches(3)
                 Cells(row, 5).Value = objMatches(row - 2).SubMatches(4)
                 row = row + 1
                Next
            Else
                C.Offset(0, 1) = "(Not matched)"
            End If

    Next
End Sub

I got the following result:

     A      B       C           D                    E 
  2  Mr.    Robert  Thomas      1104 Madison Avenue  New York, NY 10021
  3  Ms.    Angela  Carraway    402 Arlington Drive  Concord, MA 01742

Conclusion: Everything works as expected.

Community
  • 1
  • 1
PKey
  • 3,715
  • 1
  • 14
  • 39
  • I tried that tester too and it yields everything as one match with no subgroups. Is that normal? I need the 5 captured subgroups within that group if that makes sense. I have also edited my original post to show more detail of the issue I having with what I presume is the regex pattern. – J. Squillaro May 07 '17 at 16:57
  • @J.Squillaro So, in that tester (my first link) if you choose `split lists` tab you will see your 5 groups there. For how to get each group in vba - look at my second link. – PKey May 07 '17 at 17:06
  • @J.Squillaro your `vba` script worked for me fine in my excel (of course after some adaptions - I didn't use word document but an excel cell with one string `Dear Sir, Hello Sincerely, Mr. Robert Thomas 1104 Madison Avenue New York, NY 10021` - however regex part worked - excel 2010, vb script regular expressions 5.5 ) – PKey May 07 '17 at 17:54
  • @J.Squillaro take a look at my test results (edited answer) – PKey May 07 '17 at 20:15
  • This works perfect under the conditions if put it into Cell A1! I think that through all of this, the Word doc part of the code is the culprit as if I do your code with the my Word doc input, I get my same issue. So perhaps, as Rich Holton was thinking above, strInput = wrdDoc.Range.Text concatenates the text poorly causing the regex to fail. Is there any way, in your experience, one can just open the file and perform a regex on it as in Python without having to combine lines together and put it into a string? I think that's the final issue here. – J. Squillaro May 07 '17 at 20:39
  • @J.Squillaro I think that now we have narrowed down the problem space. However, this, in my opinion, is a separate issue - and has little to do with the original question (the way it was presented). The original question has been answered.... – PKey May 08 '17 at 05:16