I use this pattern to match every url in a given webpage:
import re
source = """
<p>https://example.com</p>
... some code
<font color="E80000">https://example.com</font></a>
"""
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', source)
This has worked for me pretty well until now. I found that sometimes it doesn't match the exact url. Like in the example it match as url https://example.com</p>
and https://example.com</font></a>
inlcuding the closing tags but I can't figure out what is the problem in the regex. I took this code from another stack question.