It may sound like a duplicate thread, but I swear it isn't. I already checked other posts and found nothing useful, so here I am.
What I need to do is to isolate <a hrefs
and take the value in between. I am able to do that with this regex:
(<a.*href=\"\%\(link[0-9]?[0-9]\)\".*?>(.*)?</a>)
The text I'm trying to match is:
Respectively <a href="%(link3)" target="_blank">Yosuke Matsuda</a> and <a href="%(link4)" target="_blank">Kenichi Sato</a>, as <a href="%(link5)" target="_blank">boss-fights</a>
But this is a borderline case. Infact, it matches the whole string, just like it doesn't stop on the first occurrence of </a>
.
I'm trying this both using Python, with this code:
import re
text = '''Respectively <a href="%(link3)" target="_blank">Yosuke Matsuda</a> and <a href="%(link4)" target="_blank">Kenichi Sato</a>, as <a href="%(link5)" target="_blank">boss-fights</a>'''
link_matches = re.finditer(r'(<a.*href=\"\%\(link[0-9]?[0-9]\)\".*?>(.*)</a>)', text)
try:
for index, match in enumerate(link_matches):
text = text.replace(match.groups()[0],
f'[url path="{index}"]{match.groups()[1]}[/url]')
except:
print("no match")
print(text)
.. and using regex101.
I would expect to have this result:
Respectively [url path="0"]Yosuke Matsuda[/url] and [url path="1"]Kenichi Sato[/url], as [url path="2"]boss-fights[/url]
but what I get is:
Respectively [url path="0"]boss-fights[/url]
I have already tried moving the ?
inside the </a>
with no luck.
Thank you in advance for the help!
Edit to re-open:
This post doesn't solve the problem. Infact, it stops only once. The result is:
Respectively [url path="0"]Yosuke Matsuda</a> and <a href="%(link4)" target="_blank">Kenichi Sato</a>, as <a href="%(link5)" target="_blank">boss-fights[/url].</p>