Regex not stopping on first occurrence

Question

It may sound like a duplicate thread, but I swear it isn't. I already checked other posts and found nothing useful, so here I am.

What I need to do is to isolate <a hrefs and take the value in between. I am able to do that with this regex:
(<a.*href=\"\%\(link[0-9]?[0-9]\)\".*?>(.*)?</a>)

The text I'm trying to match is:

Respectively <a href="%(link3)" target="_blank">Yosuke Matsuda</a> and <a href="%(link4)" target="_blank">Kenichi Sato</a>, as <a href="%(link5)" target="_blank">boss-fights</a>

But this is a borderline case. Infact, it matches the whole string, just like it doesn't stop on the first occurrence of </a>.

I'm trying this both using Python, with this code:

import re

text = '''Respectively <a href="%(link3)" target="_blank">Yosuke Matsuda</a> and <a href="%(link4)" target="_blank">Kenichi Sato</a>, as <a href="%(link5)" target="_blank">boss-fights</a>'''
link_matches = re.finditer(r'(<a.*href=\"\%\(link[0-9]?[0-9]\)\".*?>(.*)</a>)', text)

try:
    for index, match in enumerate(link_matches):
        text = text.replace(match.groups()[0],
                            f'[url path="{index}"]{match.groups()[1]}[/url]')
except:
    print("no match")

print(text)

.. and using regex101.

I would expect to have this result:

Respectively [url path="0"]Yosuke Matsuda[/url] and [url path="1"]Kenichi Sato[/url], as [url path="2"]boss-fights[/url]

but what I get is:

Respectively [url path="0"]boss-fights[/url]

I have already tried moving the ? inside the </a> with no luck. Thank you in advance for the help!

Edit to re-open:

This post doesn't solve the problem. Infact, it stops only once. The result is:

Respectively [url path="0"]Yosuke Matsuda</a> and <a href="%(link4)" target="_blank">Kenichi Sato</a>, as <a href="%(link5)" target="_blank">boss-fights[/url].</p>

[parsing/matching HTML with regex?](https://stackoverflow.com/a/1732454/5459839) — trincot, Sep 20 '22 at 15:03

score 0 · Accepted Answer · answered Sep 14 '22 at 15:55

0

All I had to do was to add some ?s in the regex. Like so:

(<a.*?href=\"\%\(link[0-9]?[0-9]\)\".*?>(.*?)</a>)

answered Sep 14 '22 at 15:55

Marco Frag Delle Monache

1,075
5
16

You may wish to consider changing `[0-9]?[0-9]` to `[0-9]+` if it's appropriate for your use-case. `[0-9]?[0-9]` will match `link0`, `link1`, ..., `link99` but it won't match `link100`, `link101`, .... `[0-9]+` will also match `link100` and beyond. – cba Sep 14 '22 at 16:13
2

You could also consider using a negated character class `[^<>]` matching any character except for the angle brackets `]*\bhref="%\(link[0-9]{1,2}\)"[^<>]*>(.*?)` See https://regex101.com/r/49vDXS/1 – The fourth bird Sep 14 '22 at 20:34
thank you all for the help, but still, i can't understand why my question gets closed. i even recreated the question to answer myself.. and explained why other posts don't fix it. if you are able to, please, reopen it. i will accept it as solution as soon as possible – Marco Frag Delle Monache Sep 15 '22 at 13:29
about your suggestions: including [0-9]?[0-9] is just an overloard. it's about matching some – Marco Frag Delle Monache Sep 15 '22 at 13:31

Regex not stopping on first occurrence

1 Answers1