0

I have this regex:

<li><i>(?:<a.*?>)?(.*)(?:<.*?>)?</i></li>

Now, this should either match this text:

<li><i><a href="hello.htm">Hi there</a></i></li>

or without the <a> tag, like so:

<li><i>42nd Street</i></li>

Without the <a> tag, the regex works just fine, problem is, with the first example, I get this match:

Hi there</a>

I've read about ignoring grouping with (?:regex) but I do not know why it insists on including the closing </a> tag What regex would ignore the closing </a> tag so I would only get Hi there?

mishmash
  • 4,422
  • 3
  • 34
  • 56

1 Answers1

2

The (.*) that you are capturing is greedy, and (?:<.*?>)? after it is optional, so the (.*) will always include the </a>. To fix this, change the .* to .*? so it is lazy (match as few characters as possible:

<li><i>(?:<a.*?>)?(.*?)(?:<.*?>)?</i></li>

But don't parse HTML with regex.

Community
  • 1
  • 1
Andrew Clark
  • 202,379
  • 35
  • 273
  • 306