I have a regex pattern that works perfectly in Python and various other languages, but is failing to capture the sub matches I need for my implementation in a VBScript regex (the engine of which is apparently almost identical to JavaScript). The pattern in question is as follows:
"Sincerely,[\s\n]+([\w\.]+)\s+(\w+)\s+(.+)[\s\n]+(\d+\s.+)[\s\n]+(.+)"
An example test case is as follows:
email received 3/30/17:
Dear Sir,
Hello
Sincerely,
Mr. Robert Thomas
1104 Madison Avenue
New York, NY 10021
email received 3/30/17:
Dear Sir,
Hello
Sincerely,
Ms. Angela Carraway
402 Arlington Drive
Concord, MA 01742
The objective is for a global regex that extracts 5 subgroups out of this example match after a variable keyword which here is "Sincerely,". The subgroups should be Ms.
(1st subgroup), Angela
(second subgroup), Carraway (third subgroup), 402 Arlington Drive (fourth subgroup), Concord, MA 01742 (fifth subgroup). In Python, it matches the 5 groups perfectly in a Regex tester, yet for VBScript (the JavaScript engine) it matches the entire string as a match, but with no subgroups at all. Therefore when I call the sub matches in an Excel VBA macro to write to a cell, I get all of the text jumbled up into a couple cells. What am I doing wrong? Is there some character that I am missing that is disabling capturing subgroups? If so what is the critical difference between these two engines so that I can avoid this in the future and how could one fix this pattern in this test case? I've tried reading about the differences online, yet everything said seems to be only small differences that should cause the issue I am having. Any help would be greatly appreciated because I cannot seem to isolate the difference/problem. Thank you!
Edit: The following is the VBA code that utilizes the regex:
Sub regex()
Dim docxinput As String
Dim keyword As Variant
Dim patterninput As Variant
Dim pattern As String
Dim regex As New RegExp
docxinput = Application.GetOpenFilename(Title:="Step #1: Enter Word Document Input File Name")
Dim wrdApp As Word.Application
Dim wrdDoc As Word.Document
Dim strInput As String
Set wrdApp = CreateObject("Word.Application")
wrdApp.Visible = False
Set wrdDoc = wrdApp.Documents.Open(docxinput)
strInput = wrdDoc.Range.Text
Debug.Print (strInput)
wrdDoc.Close 0
Set wrdDoc = Nothing
wrdApp.Quit
Set wrdApp = Nothing
pattern = "Sincerely,[\s\n]+([\w\.]+)\s+(\w+)\s+(.+)[\s\n]+(\d+\s.+)[\s\n]+(.+)"
Dim objMatches As MatchCollection
With regex
.Global = True
.MultiLine = True
.IgnoreCase = False
.pattern = pattern
End With
Set objMatches = regex.Execute(strInput)
Dim row As Variant
Dim SubMatches As Variant
row = 2
For Each SubMatches In objMatches
Cells(row, 1).Value = objMatches(0).SubMatches(0)
Cells(row, 2).Value = objMatches(0).SubMatches(1)
Cells(row, 3).Value = objMatches(0).SubMatches(2)
Cells(row, 4).Value = objMatches(0).SubMatches(3)
Cells(row, 5).Value = objMatches(0).SubMatches(4)
row = row + 1
Next
End Sub
This is a picture of the results. As you can see, The first two subgroups work but then the regex (or at least I think) runs into grouping error and dumps almost of the other content into the next column. It then moves onto the fourth column, running into errors there as well. Is this an issue with the code iterating or the regex itself. I have tried to troubleshoot the code and cannot find reasons why it cannot break the text up correctly other than the regex being at fault. Any thoughts?