Python - Regex matching urls in page source code

Question

I use this pattern to match every url in a given webpage:

import re

source = """
<p>https://example.com</p>
... some code
<font color="E80000">https://example.com</font></a>
"""

urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', source)

This has worked for me pretty well until now. I found that sometimes it doesn't match the exact url. Like in the example it match as url https://example.com</p> and https://example.com</font></a> inlcuding the closing tags but I can't figure out what is the problem in the regex. I took this code from another stack question.

You use a hyphen inside a character class between two symbols, `[$-_]`, that creates a range that can match `<` and `>`, and all ASCII digits and uppercase letters, and more. Replace `[$-_@.&+]` with `[-$_@.&+]`. — Wiktor Stribiżew, Feb 09 '17 at 08:55
see this link http://stackoverflow.com/questions/499345/regular-expression-to-extract-url-from-an-html-link — bob marti, Feb 09 '17 at 08:56
u can also check this http://stackoverflow.com/questions/6883049/regex-to-find-urls-in-string-in-python — bob marti, Feb 09 '17 at 08:57
URLs should be in quotes.. Do you have an special input or something? — Ika8, Feb 09 '17 at 08:59
@WiktorStribiżew This matches only the base url, like https://example.com/1 would match only https://example.com — Hyperion, Feb 09 '17 at 08:59
@bobmarti this code would not match urls in iframes (like in src="http://example.com") — Hyperion, Feb 09 '17 at 09:00
@Hyperion: Sure, that is why there are other comments. But the main idea is - forget about parsing HTML with regex. — Wiktor Stribiżew, Feb 09 '17 at 09:01
maybe this is helpul to you. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — bob marti, Feb 09 '17 at 09:20

score 1 · Accepted Answer · answered Feb 09 '17 at 09:24

try this,

import re

source = """
<p>https://example.com</p>
... some code
<font color="E80000">https://example.com</font>
https://example.com</p></a>
https://example.com</font></a>
"""
urls = re.findall('(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', source)
print urls

Python - Regex matching urls in page source code

1 Answers1