1

I have the following code to attempt to extract the content of li tags.

        string blah = @"<ul>
        <li>foo</li>
        <li>bar</li>
        <li>oof</li>
        </ul>";

        string liRegexString = @"(?:.)*?<li>(.*?)<\/li>(?:.?)*";
        Regex liRegex = new Regex(liRegexString, RegexOptions.Multiline);
        Match liMatches = liRegex.Match(blah);
        if (liMatches.Success)
        {
            foreach (var group in liMatches.Groups)
            {
                Console.WriteLine(group);
            }
        }
        Console.ReadLine();

The Regex started much simpler and without the multiline option, but I've been tweaking it to try to make it work.

I want results foo, bar and oof but instead I get <li>foo</li> and foo.

On top of this I it seems to work fine in Regex101, https://regex101.com/r/jY6rnz/1

Any thoughts?

Geesh_SO
  • 2,156
  • 5
  • 31
  • 58
  • 2
    You should not be trying to parse html with regex. html is not regular and regex (Regular Expressions) will not work well with html. Use a standard html parse method. – jdweng Aug 16 '17 at 09:29
  • 1
    https://stackoverflow.com/a/1732454/7931009 – Jakub Dąbek Aug 16 '17 at 10:03

2 Answers2

3

I will start by saying that I think as mentioned in comments you should be parsing HTML with a proper HTML parser such as the HtmlAgilityPack. Moving on to actually answer your question though...

The problem is that you are getting a single match because liRegex.Match(blah); only returns a single match. What you want is liRegex.Matches(blah) which will return all matches.

So your use would be:

var liMatches = liRegex.Matches(blah);
foreach(Match match in liMatches)
{
    Console.WriteLine(match.Groups[1].Value);
}
Chris
  • 27,210
  • 6
  • 71
  • 92
2

Your regex produces multiple matches when matched with blah. The method Match only returns the first match, which is the foo one. You are printing all groups in that first match. That will get you 1. the whole match 2. group 1 of the match.

If you want to get foo and bar, then you should print group 1 of each match. To do this you should get all the matches using Matches first. Then iterate over the MatchCollection and print Groups[1]:

string blah = @"<ul>
<li>foo</li>
<li>bar</li>
<li>oof</li>
</ul>";
string liRegexString = @"(?:.)*?<li>(.*?)<\/li>(?:.?)*";
Regex liRegex = new Regex(liRegexString, RegexOptions.Multiline);
MatchCollection liMatches = liRegex.Matches(blah);
foreach (var match in liMatches.Cast<Match>())
{
    Console.WriteLine(match.Groups[1]);
}
Sweeper
  • 213,210
  • 22
  • 193
  • 313