0

I have this text, and I try to print a1 and a2

<a href="a1" title="t1"> k1 </a>
<a href="a2" title="t2"> k2 </a>

Here is my attempt:

string html =  "<a href=\"a1\" title=\"t1\"> k1 </a>";
       html += "<a href=\"a2\" title=\"t2\"> k2 </a>";

 //here is how I think my logic expression should work:
 //<a href=" [something that is not quote, 0 or more times] " [anything] </a>
Regex regex = new Regex("<a href=\"([^\"]*)\".*</a>");
foreach (Match match in regex.Matches(html)
    Console.WriteLine(match.Groups[1]);

Why does this only print a1? I am pretty sure I am doing it right. What am I missing ?

dimitris93
  • 4,155
  • 11
  • 50
  • 86

1 Answers1

2

Your regular expression .* is consuming all characters upto the second </a>. What you need is lazy consumption with .*? so that it only consumes all characters up to the first </a>:

Regex regex = new Regex("<a href=\"([^\"]*)\".*?</a>");

Meanwhile, Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms

Community
  • 1
  • 1
William
  • 1,007
  • 7
  • 11