1

I have the following string:

In order to take this course, you must:<br>
<br>
&radic; &nbsp; &nbsp;Have access to a computer.<br>
<br>
&radic; &nbsp; &nbsp;Have continuous broadband Internet access.<br>
<br>
&radic; &nbsp; &nbsp;Have the ability/permission to install plug-ins (e.g. Adobe Reader or Flash) and software.<br>
<br>
&radic; &nbsp; &nbsp;Have the ability to download and save files and documents to a computer.<br>
<br>
&radic; &nbsp; &nbsp;Have the ability to open Microsoft file and documents (.doc, .ppt, .xls, etc.).<br>
<br>
&radic; &nbsp; &nbsp;Be competent in the English language.<br>
<br>
&radic; &nbsp; &nbsp;Have access to a relational database management system.&nbsp; A good open-source option is MySQL (<a href="http://dev.mysql.com" target="_blank">dev.mysql.com</a>).<br>
<br>
&radic; &nbsp; &nbsp;Have completed the Discrete Structures course.<br>
<br>
&radic;&nbsp;&nbsp;&nbsp; Have read the Student Handbook.

I'm trying to select the text in the middle (excluding the title, encoded spaces and <br>s), for instance, the first match should be: Have access to a computer.

I've tried the following two, but can't make it work.

This one selects the entire line: ^(?:&radic;([(&nbsp;)|\s]*))(.*)(?:(\<br\\?\>)*)$, I tried to call Regex.Matches(requirements.InnerHtml, RequirementsExtractorRegex, RegexOptions.Multiline)[0].Captures[0].Value, and here is the value: &radic; &nbsp; &nbsp;Have access to a computer.<br>.

And this one doesn't select anything: ^(?<=&radic;([(&nbsp;)|\s]*))(.*)(?=(\<br\\?\>)*)$

What am I doing wrong?

Shimmy Weitzhandler
  • 101,809
  • 122
  • 424
  • 632
  • 4
    You mean, what are you doing wrong in addition to using regular expressions to parse HTML? Surely, you've seen "[RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)"? – John Saunders Feb 27 '15 at 01:59

1 Answers1

1

A slight modification of the regex produces (almost, See below) the desired result

^(?:&radic;(?:&nbsp;|\s)*)(.*)(?:<br/?>)

Reference the target match in group #1

Regex.Matches(requirements.InnerHtml, RequirementsExtractorRegex, RegexOptions.Multiline)[0].Groups[1].Value

Tested on regexstorm with multiline match option on.

Caveat

The regex matches all targeted occurrences but the last, due to the non-optional br element. Quantifying that part includes the last occurrence in the matches but makes the capture group #1 contain the br element terminating the line - the greedy universal match overrides. Adding the line termination anchor prevents a match (though it shouldn't in my understanding of the specs - perhaps an artifact of the testing environment ?).

collapsar
  • 17,010
  • 4
  • 35
  • 61
  • It doesn't match on the last statement. I thought that `(?: |\s)*` stands for *either `&nbsp` or a whitespace, zero or more times, order doesn't matter*, isn't that so? What is then used to look for a choice of optional **words** repeated zero or more times? – Shimmy Weitzhandler Feb 27 '15 at 09:40
  • Both of your observations are correct and you got your syntax right. Imho the problem is not the second non-capturing groupbut the third one: As ist stands, it prevents matching of the last line; when qulaifying with `*`, the preceding greedy capturing group wins out in every match (ie. incudes `
    `). I have no solution to this one (other than artifically appending a `
    \n` to the original string).
    – collapsar Feb 27 '15 at 09:46
  • I tried to change the dot, but still doesn't work, see [this](http://regexr.com/3ags1) one. – Shimmy Weitzhandler Feb 27 '15 at 09:56
  • The ampersands of the html entities haven't made it into the regexr pattern line and you have to include at least the `.` in your character class. If you do so and take `.` as a stand in for the ampersand, you get `^(?:.radic;(?:.nbsp;|\s)*)([A-Za-z0-9.\s]*)(?:
    )` which matches 4 times.
    – collapsar Feb 27 '15 at 10:04
  • I ended up using my original query, excluding the final group, and replacing it manually from each resulting match. – Shimmy Weitzhandler Feb 27 '15 at 12:27