1

Using the following text as a sample, I need to be able to extract text between LI tags. Notice that the first LI is intentionally mal-formed as this may be the case. Said another way, I want everything from an LI tag to either it's closing LI tag or the next LI opening tag.

    <UL>
<LI class="test">This is the first ListItem Text. 
<LI>This is the second ListItem Test. </LI></UL>

So far I have come up with:

<[Ll][Ii].*>(.*?)((?:<[Ll][Ii]>)|(?:</[Ll][Ii]>))

But this appears to be matching the first LI tag until the closing tag as one match with the group being the text of the 2nd LI tag. I've managed to get it to return the first set but never both. I'm using the "Dot matches newline" option as well and this is .NET for which I need it to work. Thanks!

UPDATE

I had done some research prior to posting this question and did in fact see and understand that using regex's to parse html is a bad idea. That being said, I only need to be able to get text from a couple LI tags here and there to determine what text to bulletize on a powerpoint slide. I thought there might be a simpler way to do it rather than dealing with a separate library, especially when use of third party libraries is tricky to deal with where I work. Unfortunately it appears that the HTML can end up mal-formed in certain situations when using an html rich text entry box on a page that allows you to bulletize text. Thanks for all of the recommendations against REGEX use for parsing HTML. I should have specified up front that I have read a lot of similar advice already but was looking for a quick work around for a simple set of circumstances.

Tom
  • 1,179
  • 12
  • 28

6 Answers6

5

If this is a recurring scenario, I would rather use an HTML parser. Parsing HTML with Regex will take a tremendous amount of time, and might still turn out buggy, because of malformed input (that you mentioned).

Here's one I found with a basic Google search:
http://www.netomatix.com/products/Documentmanagement/HtmlParserNet.aspx

UPDATE:

Here are some related posts on StackOverflow:
How do you parse a poorly formatted HTML file?
What is the best way to parse html in C#?

Community
  • 1
  • 1
Slavo
  • 15,255
  • 11
  • 47
  • 60
  • While not exactly the solution/route I wanted to have to take for this, I recognize that it really is the RIGHT answer. Thanks. – Tom May 27 '09 at 12:36
1

As Slavo mentioned, this is difficult. The example you give is particularly tricky because the second "<LI>" needs to be treated as both the closing tag of the first match, and the opening tag of the second. This is hard.

On a totally unrelated note, you can set regex flags to be case insensitive, so that you don't have to do [Ll][Ii], etc.

Chad Birch
  • 73,098
  • 23
  • 151
  • 149
1

Try this.

<li.*?>(.*?)(?=</li>|<li.*?>|</ul>|\Z)

Note that you need to use the RegexOptions.IgnoreCase option for this to work, but it makes your expression much more readable.

harpo
  • 41,820
  • 13
  • 96
  • 131
  • This will break if both and are missing. – Tomalak Apr 21 '09 at 15:14
  • @Tomalak: It should also pick up text to the next
  • tag, as requested, and even the rest of the string if there's no more
  • ,
  • or tags. Looks exactly what the question asked for.
  • – Whatsit Apr 21 '09 at 15:17
  • @Whatsit: I don't recognize the requirement to match up to the end of the input in the question. Where does the OP say that? – Tomalak Apr 21 '09 at 15:21
  • @Tomalak: They didn't, so I suppose technically it's not *exactly* what they asked for, but I'd expect this is what they *want* – Whatsit Apr 21 '09 at 15:24