1

Given a string like the following:

"First there is a sentence or two, then a citation which I'd like to extract. Bookwriter, Johnny J., Book Title, 50th Edition, Publishing Company, United States, 2016, p. 18."

Using this regular expression:"\b[^\.\;]+(,\s+p+\.\s+(\d+\-\d+|\d+))"

I'm able to match this portion of the string:

"Book Title, 50th Edition, Publishing Company, United States, 2016, p. 18"

My desired match is:

"Bookwriter, Johnny J., Book Title, 50th Edition, Publishing Company, United States, 2016, p. 18"

To oversimplify it a bit, the current regex finds strings between a period and a page reference like ", p. 18" that doesn't have a semicolon or period in it.

I'd like to adjust this such that the regex permits a period to occur if it is preceded by a space and a capital letter. I'm aware that vba doesn't have lookbehind functionality.

The VBA code to run the example I've given is as follows:

Dim exampleString As String
exampleString = "First there is a sentence or two, then a citation which I'd like to extract. Bookwriter, Johnny J., Book Title, 50th Edition, Publishing Company, United States, 2016, p. 18."
Set re = CreateObject("vbscript.regexp")
With re
    .Global = True
    .pattern = "\b[^\.\;]+(,\s+p\.\s(\d+\-\d+|\d+))"
    Set matches = .Execute(exampleString)
End With
Community
  • 1
  • 1
user1583016
  • 79
  • 1
  • 11
  • Try this regex: [`(?:\.\s+(?=[A-Z]))([^;]+(?:,\s+p\.\s+\d+(?:\-\d+)?))`](https://regex101.com/r/fS7yR4/1). The value you need is inside submatches. If it works, I will post with code. – Wiktor Stribiżew Jan 26 '16 at 22:11
  • That works! Thank you! In your answer would you mind walking me through how it works or linking to relevant explanations to http://www.regular-expressions.info/ as this was just a simplified example and I need to fit these changes into a much larger more complex regular expression that handles other cases. – user1583016 Jan 26 '16 at 22:21
  • On second thought, the link you provided does walk me through most of it. – user1583016 Jan 26 '16 at 22:25

1 Answers1

1

Here is a sample VBA sub that can get what you need:

Sub Test1()
Dim str As String
Dim objMatches As Object
str = "First there is a sentence or two, then a citation which I'd like to extract. Bookwriter, Johnny J., Book Title, 50th Edition, Publishing Company, United States, 2016, p. 18."
Set objRegExp = CreateObject("VBScript.RegExp") ' Declare the RegExp object
objRegExp.Pattern = "(?:\.\s+(?=[A-Z]))([^;]+(?:,\s+p\.\s+\d+(?:-\d+)?))" ' Set pattern
Set objMatches = objRegExp.Execute(str)  ' Execute the regex match
If objMatches.Count <> 0 Then            ' Check if there are any items in the result
  Debug.Print objMatches.Item(0).SubMatches.Item(0) ' Print Match 1, Submatch 1
  ' > Bookwriter, Johnny J., Book Title, 50th Edition, Publishing Company, United States, 2016, p. 18
End If
End Sub

The pattern is

(?:\.\s+(?=[A-Z]))([^;]+(?:,\s+p\.\s+\d+(?:-\d+)?))

See demo

The main addition to yours is the leading (?:\.\s+(?=[A-Z])) subpattern. It matches a . followed by one or more whitespaces (\s+) that are followed by an uppercase letter (that is NOT consumed, but just checked inside a positive lookahead (?=[A-Z])). I also merged (\d+\-\d+|\d+) into \d+(?:-\d+)?.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • After trying this on my actual text I've noticed that it doesn't work when there are [two citations](https://regex101.com/r/vT5gP9/1). How can I get this to be match 1 = "Bookwriter, Johnny J., Book Title, 50th Edition, Publishing Company, United States, 2016, p. 18" and match 2 = "Another, Person I., Another Title, 20th Edition, Publishing Company, US, 2016, pp. 19-20"). – user1583016 Jan 27 '16 at 00:14
  • I was actually able to solve that one on my own with [this](https://regex101.com/r/fS7yR4/2) in case anyone has the same issue. – user1583016 Jan 27 '16 at 00:28
  • The Global RegExp property should be set to True to be able to find several matches. HHowever, this option works when you intend to get fully nonoverlapping matches (those that do not start with the same letter) abd for greedy dot matching patterns. Your regex should work fine since it is based on a negated character class. – Wiktor Stribiżew Jan 27 '16 at 06:50