I had a previous post where I used a regular expression to pull requirements from an html document. My original assumptions were that a user would enter a set of requirements in their document and that each requirement would be in a single sentence. The regular expression I was using was: (?'Requirement'<requirement>.*\n?.*</requirement>)
I've since found out that there are multiple ways authors are entering requirements in their documents. Some are using unordered lists, some are artificially carriage returns/line breaks, etc for formatting. Here is an example:
<requirement>A Report contains ratings of the following information elements as defined in
<a href="Criteria.html">Criteria</a>
</span>
<ul>
<li><span class="style2">Overall</span></li>
<li><span class="style2">Technical</span></li>
<li><span class="style2">Cost</span></li>
<li><span class="style2">Schedule</span></li>
<li><span class="style2">Customer/Quality</span></li>
<li><span class="style2">Supplier</span></li>
<li><span class="style2">Staffing</span></li>
<li><span class="style2">Performance</span></li>
</ul></requirement>
<requirement>
If the owner deviates from the criteria used in
<a href="Criteria.html">Criteria</a>, the specific rationale shall be documented on the
Report and color coded as Override (e.g., RO equals Red Override,
YO equals Yellow Override).
</requirement>
<requirement>
The justification is specifically documented as a “Override” in the Enhanced Report under the Other Tab and Report Comments.
</requirement>
<requirement>
This comment will be broken down as a Red, Yellow, Green (RYG) Override for each category that is overridden,
i.e., RYG Cost Override.
</requirement>
I've tried changing the regular expression which will match any requirement with up to 3 lines:
(?'Requirement'<requirement>.*\s?.*\s?.*\s?.*</requirement>)
However changing it to the following results in 2 of the requirements to be matched as 1 requirement.
(?'Requirement'<requirement>.*\s?.*\s?.*\s?.*\s?.*</requirement>)
I know I can get the index of a match, so I thought I would create a routine that would use the following to get a starting position:
Dim matchesreq As MatchCollection = Regex.Matches(stringReader, "(?'Requirement'<requirement>)")
For Each matchreq As Match In matchesreq
start_position = matchreq.index
I thought I would then try to pass the index value into a regular expression to find the ending <\requirement>
tag. I could then use both indexes to parse the strings to extract the requirement.
Can it be done and/or are there any thoughts/suggestions?