0

I am trying to use regex to 'extract' paragraphs in a document. Each paragraph is preceded and followed by a '-' on separate line and each paragraph starts with a number.

For example

-
1. This is a paragraph
It may go over multiple lines
-

Ideally, I would like to not include the '-', but it doesn't really matter as I will be placing it in a string and running another regex against it (One that I know works)

The code I am trying to use is basically as follows

Dim matchPara as Object
Dim regex as Object
Dim theMatch as Object
Dim matches as Object
Dim fileName as String
Dim fileNo as Integer
Dim document as String

matchPara = "-?(\d.*?)?-"
Set regex = CreateObject("VBScript.RegExp")
regex.Pattern = matchPara
regex.Global = True
regex.Multiline = True

fileName = "C:\file.txt"
fileNo = FreeFile

Open fileName For Input As #fileNo
document = Input$(LOF(fileNo), fileNo)
set matches = regex.Execute(document)

For Each theMatch in matches
    MsgBox(theMatch.Value)
Next theMatch

Close #fileNo

I have tested this regex on regex101 and it appeared to do what I wanted. I have also tested it without the grouping

-?\d.*?-

However when I run the code the theMatch.Value only ever contains a single '-'. After some messing around with the regex I got it to display the first line of text but never any more than the first line.

I have checked the length of theMatch.Value with:

MsgBox(len(theMatch.Value))

and placed the contents of theMatch.Value in a cell on the worksheet to see if It was cutting off in the message box, but both theories were proved wrong.

I am at a complete loss now and I am beginning to suspect it is possibly a VBA thing and not a regex thing. There is no requirement to use regex, I just assumed it would be the easiest thing to do.

The paragraphs contain data that I am trying to extract. So the idea was to regex each paragraph out place that in a string then run other regex to get the information that I need. Some paragraphs wont contain the data that I need so the idea was to loop through each individual paragraph and then error handle better if the data I need wasn't in that paragraph (ie get what I can and drop the rest with an error message)

Here is a screenshot:

regex101 Screenshot

halfer
  • 19,824
  • 17
  • 99
  • 186
Nasica
  • 53
  • 7
  • I don't think it's working properly on **regex101**. Could you post a screenshot of the results from _regex101_ showing that it's returning your full paragraphs there? Also, is there a reason this need to be done with regular expressions? Also, what are you actually doing with the output? (I assume the goal isn't just to show them in `MsgBox`'s? – ashleedawg Dec 23 '17 at 14:06
  • Added screenshot and answers to your question – Nasica Dec 23 '17 at 14:18
  • The RegEx in your screenshot is not the same as in your code. – ashleedawg Dec 23 '17 at 14:20
  • No, but it's the same as the second example I gave that also doesn't work – Nasica Dec 23 '17 at 14:22
  • Check this answer: [Use Regex to Split Numbered List array into Numbered List Multiline](https://stackoverflow.com/q/46331543/7690982) – danieltakeshi Jan 09 '18 at 12:07

3 Answers3

1

This simple approach does not use Regex. It assumes the data is in column A and the paragraphs are placed in column B:

Sub paragraph_no_regex()
    Dim s As String
    Dim ary

    With Application.WorksheetFunction
        s = .TextJoin(" ", False, Columns(1).SpecialCells(2))
    End With

    ary = Split(s, "-")
    i = 1
    For Each a In ary
        Cells(i, 2) = a
        i = i + 1
    Next a
End Sub

enter image description here

Gary's Student
  • 95,722
  • 10
  • 59
  • 99
0
Sub F()

    Dim re As New RegExp
    Dim sMatch As String
    Dim document As String

    re.Pattern = "-\n((.|\n)+?)\n-"

    'Getting document
    document = ...

    sMatch = re.Execute(document)(0).SubMatches(0)

End Sub

If you need dashes -, then just include them into capture group (the outer parenthesis).

JohnyL
  • 6,894
  • 3
  • 22
  • 41
0

This RegEx matches your description and successfully extracts paragraphs (as tested on regex101.com):

matchPara = "-\n\d+\.\s*((?:.|\n)+?)\s*\n-"

It needs the 'global' flag but not the 'multiline' flag. Instead, the end-of-line token is matched in the regex. The main point is that the innermost matching group will match any character including the end-of-line (given as an alternative) but does so in a non-greedy way ("+?"). It doesn't care about word boundaries as this is not necessary here. Also, "-" is not a special character where used in the regex so it doesn't have to be escaped.

As added benefit leading and trailing whitespace is cut off ("\s*" outside the group).

user1016274
  • 4,071
  • 1
  • 23
  • 19