Setting the Starting Position to Begin Regular Expression Searching

Question

I had a previous post where I used a regular expression to pull requirements from an html document. My original assumptions were that a user would enter a set of requirements in their document and that each requirement would be in a single sentence. The regular expression I was using was: (?'Requirement'<requirement>.*\n?.*</requirement>) I've since found out that there are multiple ways authors are entering requirements in their documents. Some are using unordered lists, some are artificially carriage returns/line breaks, etc for formatting. Here is an example:

<requirement>A Report contains ratings of the following information elements as defined in 
<a href="Criteria.html">Criteria</a>

</span> 
<ul>
                <li><span class="style2">Overall</span></li>
                <li><span class="style2">Technical</span></li>
                <li><span class="style2">Cost</span></li>
                <li><span class="style2">Schedule</span></li>
                <li><span class="style2">Customer/Quality</span></li>
                <li><span class="style2">Supplier</span></li>
                <li><span class="style2">Staffing</span></li>
                <li><span class="style2">Performance</span></li>
</ul></requirement>

<requirement>
If the owner deviates from the criteria used in
<a href="Criteria.html">Criteria</a>, the specific rationale shall be documented on the 
Report and color coded as Override (e.g., RO equals Red Override, 
YO equals Yellow Override).
</requirement>

                <requirement>
                The justification is specifically documented as a “Override” in the Enhanced Report under the Other Tab and Report Comments.
                </requirement>

                <requirement>
                This comment will be broken down as a Red, Yellow, Green (RYG) Override for each category that is overridden, 
                i.e., RYG Cost Override.
                </requirement>

I've tried changing the regular expression which will match any requirement with up to 3 lines:

(?'Requirement'<requirement>.*\s?.*\s?.*\s?.*</requirement>)

However changing it to the following results in 2 of the requirements to be matched as 1 requirement.

(?'Requirement'<requirement>.*\s?.*\s?.*\s?.*\s?.*</requirement>)

I know I can get the index of a match, so I thought I would create a routine that would use the following to get a starting position:

Dim matchesreq As MatchCollection = Regex.Matches(stringReader, "(?'Requirement'<requirement>)")
For Each matchreq As Match In matchesreq
 start_position = matchreq.index

I thought I would then try to pass the index value into a regular expression to find the ending <\requirement> tag. I could then use both indexes to parse the strings to extract the requirement.

Can it be done and/or are there any thoughts/suggestions?

Feels like VB.NET. What stops you from using an HTML-specific library, e.g. HtmlAgilityPack? — Wiktor Stribiżew, Jun 24 '15 at 15:08
are you only trying to match requirements with 3 lines? Or are you trying to get all text between all requirement tags? — jmrah, Jun 24 '15 at 15:09
Please see http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags. — John Saunders, Jun 24 '15 at 15:17
Have you tried something like this ? `(?s)(?'Requirement'.*?)` or `(?'Requirement'[\S\s]*?)` or `(?s)(?'Requirement'(?:(?!?requirement>).)*?)` — , Jun 24 '15 at 16:05

score 0 · Accepted Answer · answered Jun 24 '15 at 20:23

I ended up using a combination of regular expressions and .net string matching.

Dim matchesreq As MatchCollection = Regex.Matches(stringReader, "(?'Requirement'<requirement>)") '.*\n?.*c)")
            For Each matchreq As Match In matchesreq
                startpos = matchreq.Index + 13
                endpos = stringReader.IndexOf("</requirement>", startpos)
                reqstring = stringReader.Substring(startpos, endpos - startpos)
                reqstring = reqstring.Replace(vbCrLf, " ")
                reqstring = reqstring.Replace(vbCr, " ")
                reqstring = reqstring.Replace(vbLf, " ")
                reqstring = reqstring.Replace(vbTab, " ")
                reqstring = reqstring.Replace("  ", "")
                Dim matchesbad As MatchCollection = Regex.Matches(reqstring, "(?'baddata'<(?!l).*?>)")
                For Each matchbad As Match In matchesbad
                    Dim badgroups As GroupCollection = matchbad.Groups
                    thebaddata = badgroups("baddata").ToString
                    reqstring = reqstring.Replace(thebaddata, "")
                Next
                matchesbad = Regex.Matches(reqstring, "(?'baddata'<.*?>)")
                For Each matchbad As Match In matchesbad
                    Dim badgroups As GroupCollection = matchbad.Groups
                    thebaddata = badgroups("baddata").ToString
                    reqstring = reqstring.Replace(thebaddata, vbLf)
                Next
                theRequirement = """" & reqstring.Trim & """"
                TempStr(0) = themfo
                TempStr(1) = thefunc_org
                TempStr(2) = thedisciplne
                TempStr(3) = oldTopicTitle
                TempStr(4) = theRevision
                TempStr(5) = thePageTitle
                TempStr(6) = theRelDate
                TempStr(7) = theRequirement
                TempNode = New ListViewItem(TempStr)
                lv1.Items.Add(TempNode)

           Next

The first thing I do is populate a match group with the starting requirement tag position. I then use the index of that to do an indexof to get the position of the ending tag. I then populate a string with the characters between the starting index and the ending index. I then clean the tags EXCEPT the <li> tags from the string. Lastly a replace the <li> tags with line feeds.

Setting the Starting Position to Begin Regular Expression Searching

1 Answers1