1

Edit 1 - solution

Using the pattern posted below as a start, I edited the expression slightly to ensure it captures phrases as well as words, then we just discard any match that starts with a .

Sub test()
    Application.EnableEvents = False
    getAllTitleCasePhrases Range("D4", Range("D4").End(xlDown))
    Application.EnableEvents = True
End Sub

Private Sub getAllTitleCasePhrases(rng As Range)
    If objRegex Is Nothing Then Set objRegex = New RegExp
        With objRegex
            .Global = True
            .Pattern = "(\.\s*[A-Z][\w,']*)|([A-Z][\w,']*\s?)+"
            End With

    Dim cll As Range, testResult As Object, resultsStr As String
    For Each cll In rng
        Set testResult = objRegex.Execute(cll.Value2)
        If testResult.Count > 0 Then
            Dim i As Long, j As Long
            For i = 0 To testResult.Count - 1
                If Left$(testResult(i), 1) <> "." Then
                    resultsStr = resultsStr & WorksheetFunction.Trim(testResult(i).Value) & ", "
                End If
            Next i
        End If
        If Len(resultsStr) > 0 Then resultsStr = Left(resultsStr, Len(resultsStr) - 2)
        cll.Offset(0, 1).Value2 = resultsStr
        resultsStr = vbNullString
    Next cll
    Set objRegex = Nothing
End Sub

Problem

I would like to automatically create a list of all defined terms in a contract, so that I can then easily check that they do actually have a corresponding definition somewhere. Defined terms in a contract are written in either Title Case or ALL CAPS, I have included some test cases and expected output below.

My approach is to use a regex pattern to scan through the document and pick out all words or phrases that fit these patterns. This part has been straightforward, by using answers here and here. The issues I have are that (1) this still catches words that are capitalised after a full stop and (2) the space at the end of the last word is included in the match.

Please note my implementation below is in Excel VBA, but it is the regex expression I am really interested in. If both (1) and (2) cannot be achieved together, (1) is more important, as I can strip trailing spaces off elsewhere in the code.

Current solution

(^[^\.])?([A-Z]+[a-z,']*\s?)+

Explanation:

(^[^\.])? - if there is a full stop at the start of the string, ignore it

[A-Z]+ - match words which start with at least one capital

[a-z,']* - words can finish with any number of non-capitals and apostrophes are allowed

([A-Z]+[a-z,']*\s?)+ - repeat this pattern to catch multiple words

Test cases

  • To Compare between -> To Compare
  • to Compare -> Compare
  • to Compare Between -> Compare Between
  • to COMPARE BETWEEN -> COMPARE BETWEEN
  • to COMPARE BETWEEN Two Options -> COMPARE BETWEEN Two Options
  • the Purchaser shall -> Purchaser
  • The Purchaser shall Pay On Time according to the Schedule agreed between the Parties -> The Purchaser, Pay On Time, Schedule, Parties
  • . The Purchaser shall Pay On Time according to the Schedule agreed between the Parties -> Purchaser, Pay On Time, Schedule, Parties
  • The Purchaser -> The Purchaser
  • . The Purchaser -> Purchaser
  • . To COMPARE -> COMPARE
  • the Purchaser's Representative -> Purchaser's Representative
  • the Purchasers' Representative -> Purchasers' Representative
  • The Purchaser's Representative -> The Purchaser's Representative
  • . The Purchaser's Representative -> Purchaser's Representative
  • ACME International group -> ACME International
  • ACME International Group -> ACME International Group
  • ABC -> ABC

Implementation

(note list of test cases starts at cell D4)

Option Explicit
Private objRegex As RegExp

Sub test()
    getAllTitleCasePhrases Range("D4", Range("D4").End(xlDown))
End Sub

Private Sub getAllTitleCasePhrases(rng As Range)
    If objRegex Is Nothing Then Set objRegex = New RegExp
        With objRegex
            .Global = True
            .Pattern = "(^[^\.])?([A-Z]+[a-z,']*\s?)+"
            End With

    Dim cll As Range, testResult As Object, resultsStr As String
    For Each cll In rng
        Set testResult = objRegex.Execute(cll.Value2)
        If testResult.Count > 0 Then
            Dim i As Long
            For i = 0 To testResult.Count - 1
                resultsStr = resultsStr & testResult(i).Value & ", "
            Next i
        End If
        If Len(resultsStr) > 0 Then resultsStr = Left(resultsStr, Len(resultsStr) - 2)
        cll.Offset(0, 1).Value2 = resultsStr
        resultsStr = vbNullString
    Next cll
    Set objRegex = Nothing
End Sub
Community
  • 1
  • 1
Jonathan
  • 1,015
  • 1
  • 9
  • 25
  • 1
    Do this. Check for words following a full stop using `\.\s*[A-Z][\w,']*` and then look for desired words using `([A-Z][\w,'])` then you only need to work with capturing group 1. There is no trailing spaces, no words following full stops. Regex would be `\.\s*[A-Z][\w,']*|([A-Z][\w,']*)` – revo May 10 '18 at 08:44
  • Look at green words in live demo https://regex101.com/r/GAiQYX/1 – revo May 10 '18 at 08:48
  • 1
    When using revo's solution, use `testResult(i).Submatches(0)` to get the contents of Capturing group 1. You can't just have a plain regex fix without amending the code since what you need is to match something in one context to discard, and match and capture that in different contexts to keep. – Wiktor Stribiżew May 10 '18 at 09:27
  • Thanks both - I've got it now, please see my question. I'm happy to upvote an answer if you post one. – Jonathan May 10 '18 at 09:45

0 Answers0