2

I know that with urllib you can parse a string and check if it's a valid URL. But how would one go about checking if a sentence contains a URL within it, and then extract that URL. I've seen some huge regular expressions out there, but i would rather not use something that I really can't comprehend.

So basically I have an input string, and I need to find and extract all the URLs within that string.

What's a clean way of going about this.

Cooper
  • 21
  • 1
  • 2
  • If your input source is html or xml don't do it this way, use a proper parser instead. – Daenyth Mar 19 '11 at 19:29
  • Could you post a typical example input? – Mark Byers Mar 19 '11 at 19:58
  • URL matching is quite a huge topic, with a lot of rules... that is why all the regex you find are big and hard to comprehend. Try to check this regex (that is split to match the various URL parts): https://stackoverflow.com/questions/9760588/how-do-you-extract-a-url-from-a-string-using-python/31952097#31952097 – Paolo Rovelli Aug 11 '15 at 21:25

2 Answers2

2

You can search for "words" containing : and then pass them to urlparse (renamed to urllib.parse in Python 3.0 and newer) to check if they are valid URLs.

Example:

possible_urls = re.findall(r'\S+:\S+', text)

If you want to restrict yourself only to URLs starting with http:// or https:// (or anything else you want to allow) you can also do that with regular expressions, for example:

possible_urls = re.findall(r'https?://\S+', text)

You may also want to use some heuristics to determine where the URL starts and stops because sometimes people add punctuation to the URLs, giving new valid but unintentionally incorrect URLs, for example:

Have you seen the new look for http://example.com/? It's a total ripoff of http://example.org/!

Here the punctuation after the URL is not intended to be part of the URL. You can see from the automatically added links in the above text that StackOverflow implements such heuristics.

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • `://` is common, but a URL may not contain `://` at all. [RFC here](http://tools.ietf.org/html/rfc1738#section-5). – khachik Mar 19 '11 at 19:31
1

Plucking a URL out of "the wild" is a tricky endeavor (to do correctly). Jeff Atwood wrote a blog post on this subject: The Problem With URLs Also, John Gruber has addressed this issue as well: An Improved Liberal, Accurate Regex Pattern for Matching URLs Also, I have written some code which also attempts to tackle this problem: URL Linkification (HTTP/FTP) (for PHP/Javascript). (Note that my regex is particularly complex because it is designed to be applied to HTML markup, and attempts to skip URLs which are already linkified (i.e. <a href="http://example.com">Link!</a>)

Second, when it comes to validating a URI/URL, the document you want to look at is RFC-3986. I've been working on a article dealing with this very subject: Regular Expression URI Validation. You may want to take a look at this as well.

But when you get down to it, this is not a trivial task!

Community
  • 1
  • 1
ridgerunner
  • 33,777
  • 5
  • 57
  • 69