Use REGEX to find Contents of HTML ListItem (.NET)

Question

Using the following text as a sample, I need to be able to extract text between LI tags. Notice that the first LI is intentionally mal-formed as this may be the case. Said another way, I want everything from an LI tag to either it's closing LI tag or the next LI opening tag.

    <UL>
<LI class="test">This is the first ListItem Text. 
<LI>This is the second ListItem Test. </LI></UL>

So far I have come up with:

<[Ll][Ii].*>(.*?)((?:<[Ll][Ii]>)|(?:</[Ll][Ii]>))

But this appears to be matching the first LI tag until the closing tag as one match with the group being the text of the 2nd LI tag. I've managed to get it to return the first set but never both. I'm using the "Dot matches newline" option as well and this is .NET for which I need it to work. Thanks!

UPDATE

I had done some research prior to posting this question and did in fact see and understand that using regex's to parse html is a bad idea. That being said, I only need to be able to get text from a couple LI tags here and there to determine what text to bulletize on a powerpoint slide. I thought there might be a simpler way to do it rather than dealing with a separate library, especially when use of third party libraries is tricky to deal with where I work. Unfortunately it appears that the HTML can end up mal-formed in certain situations when using an html rich text entry box on a page that allows you to bulletize text. Thanks for all of the recommendations against REGEX use for parsing HTML. I should have specified up front that I have read a lot of similar advice already but was looking for a quick work around for a simple set of circumstances.

score 5 · Accepted Answer · edited May 23 '17 at 10:32

5

If this is a recurring scenario, I would rather use an HTML parser. Parsing HTML with Regex will take a tremendous amount of time, and might still turn out buggy, because of malformed input (that you mentioned).

Here's one I found with a basic Google search:
http://www.netomatix.com/products/Documentmanagement/HtmlParserNet.aspx

UPDATE:

Here are some related posts on StackOverflow:
How do you parse a poorly formatted HTML file?
What is the best way to parse html in C#?

edited May 23 '17 at 10:32

Community

1
1

answered Apr 21 '09 at 15:00

Slavo

15,255
11
47
60

While not exactly the solution/route I wanted to have to take for this, I recognize that it really is the RIGHT answer. Thanks. – Tom May 27 '09 at 12:36

score 1 · Answer 2 · answered Apr 21 '09 at 15:01

As Slavo mentioned, this is difficult. The example you give is particularly tricky because the second "<LI>" needs to be treated as both the closing tag of the first match, and the opening tag of the second. This is hard.

On a totally unrelated note, you can set regex flags to be case insensitive, so that you don't have to do [Ll][Ii], etc.

harpo · Answer 3 · 2009-04-21T15:20:08.110

1

Try this.

<li.*?>(.*?)(?=</li>|<li.*?>|</ul>|\Z)

Note that you need to use the RegexOptions.IgnoreCase option for this to work, but it makes your expression much more readable.

edited Apr 21 '09 at 15:20

answered Apr 21 '09 at 15:03

harpo

41,820
13
96
131

This will break if both and are missing. – Tomalak Apr 21 '09 at 15:14
@Tomalak: It should also pick up text to the next
tag, as requested, and even the rest of the string if there's no more
or tags. Looks exactly what the question asked for.

Whatsit

Apr 21 '09 at 15:17

@Whatsit: I don't recognize the requirement to match up to the end of the input in the question. Where does the OP say that? – Tomalak Apr 21 '09 at 15:21

@Tomalak: They didn't, so I suppose technically it's not *exactly* what they asked for, but I'd expect this is what they *want* – Whatsit Apr 21 '09 at 15:24

score 1 · Answer 4 · answered Apr 21 '09 at 15:04

I feel like a broken vinyl record, but: don't use regular expressions to parse non-regular languages.

There are tons of .NET HTML parsers available, some of them also can correct malformed HTML. I googled ".net html parser malformed" and there seem to be some promising results.

score 1 · Answer 5 · edited May 23 '17 at 12:10

1

Regexes are bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML Parser like Html Agility Pack.

edited May 23 '17 at 12:10

Community

1
1

answered Apr 21 '09 at 15:05

Chas. Owens

64,182
22
135
226

score 0 · Answer 6 · answered Apr 21 '09 at 15:03

0

If your input is reasonably valid (and the list items contain text only), you might get away with:

<li[^>]*>([^<]*)

Apply as global/case insensitive and look for the contents of match group 1.

The result will need some normalization (trimming, replacing newlines).

answered Apr 21 '09 at 15:03

Tomalak

332,285
67
532
628

Nevertheless - Regex is bad for HTML parsing, like some of the others said. This is why I said "might get away with". – Tomalak Apr 21 '09 at 15:10

Use REGEX to find Contents of HTML ListItem (.NET)

6 Answers6