5

I'm using the Microsoft regular expression engine in Excel VBA. I'm very new to regex but I have a pattern working right now. I need to expand it and I'm having trouble. Here is my code so far:

Sub ImportFromDTD()

Dim sDTDFile As Variant
Dim ffile As Long
Dim sLines() As String
Dim i As Long
Dim Reg1 As RegExp
Dim M1 As MatchCollection
Dim M As Match
Dim myRange As Range

Set Reg1 = New RegExp

ffile = FreeFile

sDTDFile = Application.GetOpenFilename("DTD Files,*.XML", , _
"Browse for file to be imported")

If sDTDFile = False Then Exit Sub '(user cancelled import file browser)


Open sDTDFile For Input Access Read As #ffile
  Lines = Split(Input$(LOF(ffile), #ffile), vbNewLine)
Close #ffile

Cells(1, 2) = "From DTD"
J = 2

For i = 0 To UBound(Lines)

  'Debug.Print "Line"; i; "="; Lines(i)

  With Reg1
      '.Pattern = "(\<\!ELEMENT\s)(\w*)(\s*\(\#\w*\)\s*\>)"
      .Pattern = "(\<\!ELEMENT\s)(\w*)(\s*\(\#\w*\)\s*\>)"

      .Global = True
      .MultiLine = True
      .IgnoreCase = False
  End With

  If Reg1.Test(Lines(i)) Then
    Set M1 = Reg1.Execute(Lines(i))
    For Each M In M1
      sExtract = M.SubMatches(1)
      sExtract = Replace(sExtract, Chr(13), "")
      Cells(J, 2) = sExtract
      J = J + 1
      'Debug.Print sExtract
    Next M
  End If
Next i

Set Reg1 = Nothing

End Sub

Currently, I'm matching on a set of data like this:

 <!ELEMENT DealNumber  (#PCDATA) >

and extract Dealnumber but now, I need to add another match on data like this:

<!ELEMENT DealParties  (DealParty+) >

and extract just Dealparty without the Parens and the +

I've been using this as a reference and it's awesome but I'm still a bit confused. How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops

EDIT

I have come across a few new scenarios that have to be matched on.

 Extract Deal
 <!ELEMENT Deal  (DealNumber,DealType,DealParties) >

 Extract DealParty the ?,CR are throwing me off
 <!ELEMENT DealParty  (PartyType,CustomerID,CustomerName,CentralCustomerID?,
           LiabilityPercent,AgentInd,FacilityNo?,PartyReferenceNo?,
           PartyAddlReferenceNo?,PartyEffectiveDate?,FeeRate?,ChargeType?) >

 Extract Deals
 <!ELEMENT Deals  (Deal*) >
Community
  • 1
  • 1
Matt Williamson
  • 6,947
  • 1
  • 23
  • 36

2 Answers2

3

Looking at your pattern, you have too many capture groups. You only want to capture the PCDATA and DealParty. Try changing you pattern to this:

  With Reg1
      .Pattern = "\<!ELEMENT\s+\w+\s+\(\W*(\w+)\W*\)"

      .Global = True
      .MultiLine = True
      .IgnoreCase = False
  End With

Here's the stub: Regex101.

Code Different
  • 90,614
  • 16
  • 144
  • 163
1

You could use this Regex pattern;

  .Pattern = "\<\!ELEMENT\s+(\w+)\s+\((#\w+|(\w+)\+)\)\s+\>"
  1. This portion

(#\w+|(\w+)\+)

says match either

#a-z0-9
a-z0-9+

inside the parentheses.

ie match either

(#PCDATA)
(DealParty+)

to validate the entire string

  1. Then the submatches are used to extract DealNumber for the first valid match, DealParty for the other valid match

edited code below - note submatch is now M.submatches(0)

    Sub ImportFromDTD()

Dim sDTDFile As Variant
Dim ffile As Long
Dim sLines() As String
Dim i As Long
Dim Reg1 As RegExp
Dim M1 As MatchCollection
Dim M As Match
Dim myRange As Range

Set Reg1 = New RegExp
J = 1

strIn = "<!ELEMENT Deal12Number  (#PCDATA) > <!ELEMENT DealParties  (DealParty+) >"

With Reg1
      .Pattern = "\<\!ELEMENT\s+(\w+)\s+\((#\w+|(\w+)\+)\)\s+\>"
      .Global = True
      .MultiLine = True
      .IgnoreCase = False
End With

If Reg1.Test(strIn) Then
    Set M1 = Reg1.Execute(strIn)
    For Each M In M1
      sExtract = M.SubMatches(2)
      If Len(sExtract) = 0 Then sExtract = M.SubMatches(0)
      sExtract = Replace(sExtract, Chr(13), "")
      Cells(J, 2) = sExtract
      J = J + 1
    Next M
End If

Set Reg1 = Nothing

End Sub
brettdj
  • 54,857
  • 16
  • 114
  • 177
  • Thank for your post, it was the exact answer for my question. Since then, I've come across a few more matches that I need. One of them being a multiline one and I'm having trouble getting the pattern to match. I've been on the regex101 site all day working on it. I've edited my original post to include them. I'm thinking I might not be able to do them all in one pattern. – Matt Williamson Oct 16 '15 at 18:42
  • I'm just going to ask a new question for the additional matches. Thanks for your help! – Matt Williamson Oct 19 '15 at 12:04
  • Hi Brett. Can you take a look at my new post for me? It got buried very fast and was downgraded for what I feel were very ludicrous reasons. [Second try](http://stackoverflow.com/questions/33215467) – Matt Williamson Oct 19 '15 at 19:34