-1

Test string has four classes, all on one row, that's why I can't use '.' for any match. This means I have to use a character list. Working example where everything is fine here. Span 1, 2, and 4 get correctly matched. Link: https://regex101.com/r/QpHJNw/1

However let's assume the first character list has to gets extended to [\w\d\s"=<>:./] because for some reason the test string will contain such data, regardless if its html conform:

New regexp:

<span class=\"select\"[\w\d\s\"=<>:./]*><a href=[\w\d\s\"<>:/.-]+</a></span>

New test string:

<span class="select" foo="1<>/"><a href="https://www.domain.tld">Word 1</a></span><span class="select"><a href="https://www.domain.tld">Word 2</a></span><span class="no_select"><a href="https://www.domain.tld">Word 3</a></span><span class="select" bar="2"><a href="https://www.domain.tld">Word 4</a></span>

Link: https://regex101.com/r/6S37B3/1

This matches naturally the entire string as the character list contains all the used signs. Is there a way to give the ending </a></span> of the regpex a higher priority, meaning regardless what's in the character list, do always match the first occurrence of it. In the end it should do the same matching as in the first example, span 1, 2, and 4.

Thanks!

Jim B
  • 641
  • 5
  • 18
  • 1
    If the data you expect will contain invalid html, how can you be sure that the invalid text it contains doesn't also contain `` where you don't want to match it? – Grismar Jan 24 '22 at 00:00
  • As for a solution to your problem, adding a `?` right after the `*` seems like a straightforward one? (i.e. you don't want `*` to be greedy here) – Grismar Jan 24 '22 at 00:01

1 Answers1

0

You should use non greedy regex by adding ? to * and +:

<span class=\"select\"[\w\d\s\"=<>:./]*?><a href=[\w\d\s\"<>:/.-]+?</a></span>

P.S. Don't parse HTML with regexp, think about bs4/html.parser/lxml

Alex Kosh
  • 2,206
  • 2
  • 19
  • 18