0

I use this pattern to match every url in a given webpage:

import re

source = """
<p>https://example.com</p>
... some code
<font color="E80000">https://example.com</font></a>
"""

urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', source)

This has worked for me pretty well until now. I found that sometimes it doesn't match the exact url. Like in the example it match as url https://example.com</p> and https://example.com</font></a> inlcuding the closing tags but I can't figure out what is the problem in the regex. I took this code from another stack question.

Hyperion
  • 2,515
  • 11
  • 37
  • 59
  • 1
    You use a hyphen inside a character class between two symbols, `[$-_]`, that creates a range that can match `<` and `>`, and all ASCII digits and uppercase letters, and more. Replace `[$-_@.&+]` with `[-$_@.&+]`. – Wiktor Stribiżew Feb 09 '17 at 08:55
  • see this link http://stackoverflow.com/questions/499345/regular-expression-to-extract-url-from-an-html-link – bob marti Feb 09 '17 at 08:56
  • u can also check this http://stackoverflow.com/questions/6883049/regex-to-find-urls-in-string-in-python – bob marti Feb 09 '17 at 08:57
  • URLs should be in quotes.. Do you have an special input or something? – Ika8 Feb 09 '17 at 08:59
  • @WiktorStribiżew This matches only the base url, like https://example.com/1 would match only https://example.com – Hyperion Feb 09 '17 at 08:59
  • @bobmarti this code would not match urls in iframes (like in src="http://example.com") – Hyperion Feb 09 '17 at 09:00
  • @Hyperion: Sure, that is why there are other comments. But the main idea is - forget about parsing HTML with regex. – Wiktor Stribiżew Feb 09 '17 at 09:01
  • maybe this is helpul to you. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – bob marti Feb 09 '17 at 09:20

1 Answers1

1

try this,

import re

source = """
<p>https://example.com</p>
... some code
<font color="E80000">https://example.com</font>
https://example.com</p></a>
https://example.com</font></a>
"""
urls = re.findall('(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', source)
print urls
Arun
  • 1,149
  • 11
  • 22