26

Although I know I could use some hugeass regex such as the one posted here I'm wondering if there is some tweaky as hell way to do this either with a standard module or perhaps some third-party add-on?

Simple question, but nothing jumped out on Google (or Stackoverflow).

Look forward to seeing how y'all do this!

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
jkp
  • 78,960
  • 28
  • 103
  • 104
  • 5
    See http://stackoverflow.com/questions/285938/decomposing-html-to-link-text-and-target, http://stackoverflow.com/questions/430966/regex-for-links-in-html-text, http://stackoverflow.com/questions/442660/link-extraction-in-python-closed – S.Lott Feb 06 '09 at 12:36
  • 1
    Practically, in this situation people often do not match strictly for urls patterns. Typically, trailing punctuations are considered not part of the URL. This kind of choice is application-specific, hence the lack of standard library module. – ddaa Feb 06 '09 at 13:06
  • 2
    @S.Lott- he's asking about parsing urls from strings not anchors from html – Yarin May 31 '12 at 14:18
  • For simple usage use re.findall(r'www.[^\s]+?(?:in|com|extensions)',string) – Wickkiey Aug 26 '15 at 16:35

9 Answers9

21

I know that it's exactly what you do not want but here's a file with a huge regex:

#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
the web url matching regex used by markdown
http://daringfireball.net/2010/07/improved_regex_for_matching_urls
https://gist.github.com/gruber/8891611
"""
URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""

I call that file urlmarker.py and when I need it I just import it, eg.

import urlmarker
import re
re.findall(urlmarker.URL_REGEX,'some text news.yahoo.com more text')

cf. http://daringfireball.net/2010/07/improved_regex_for_matching_urls

Also, here is what Django (1.6) uses to validate URLFields:

regex = re.compile(
    r'^(?:http|ftp)s?://'  # http:// or https://
    r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|'  # domain...
    r'localhost|'  # localhost...
    r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|'  # ...or ipv4
    r'\[?[A-F0-9]*:[A-F0-9:]+\]?)'  # ...or ipv6
    r'(?::\d+)?'  # optional port
    r'(?:/?|[/?]\S+)$', re.IGNORECASE)

cf. https://github.com/django/django/blob/1.6/django/core/validators.py#L43-50

Django 1.9 has that logic split across a few classes

pylang
  • 40,867
  • 14
  • 129
  • 121
dranxo
  • 3,348
  • 4
  • 35
  • 48
14

Look at Django's approach here: django.utils.urlize(). Regexps are too limited for the job and you have to use heuristics to get results that are mostly right.

Cyphase
  • 11,502
  • 2
  • 31
  • 32
Björn Lindqvist
  • 19,221
  • 20
  • 87
  • 122
14

There is an excellent comparison of 13 different regex approaches

...which can be found at this page: In search of the perfect URL validation regex.

The Diego Perini regex, which passed all the tests, is very long but is available at his gist here.
Note that you will have to convert his PHP version to python regex (there are slight differences).

I ended up using the Imme Emosol version which passes the vast majority of tests and is a fraction of the size of Diego Perini's.

Here is a python-compatible version of the Imme Emosol regex:

r'^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$'
Community
  • 1
  • 1
tohster
  • 6,973
  • 5
  • 38
  • 55
  • This fails to recover any valid url when I input the string `Assume that this regex will be used for a public URL shortener written in PHP, so URLs like http://localhost/, https://www.webdesignerdepot.com/2012/10/creating-a-modal-window-with-html5-and-css3/, //foo.bar/, ://foo.bar/, data:text/plain;charset=utf-8,OHAI and tel:+1234567890 shouldn’t pass (even though they’re technically valid). Also, in this case I only want to allow the HTTP, HTTPS and FTP protocols.` There's at least 1 valid url in that list. – Hassan Baig Apr 25 '18 at 12:06
7

You can use this library I wrote:

https://github.com/imranghory/urlextractor

It's extremely hacky, but it doesn't rely upon "http://" like many other techniques, rather it uses the Mozilla TLD list (via the tldextract library) to search for TLDs (i.e ".co.uk", ".com", etc.) in the text and then attempts to construct urls around the TLD.

It doesn't aim to be RFC compliant but rather accurate for how urls are used in practice in the real world. So for example it will reject the technically valid domain "com" (you can actually use a TLD as a domain; although it's rare in practice) and will strip trail full-stops or commas from urls.

Imran
  • 581
  • 1
  • 6
  • 7
6

if you know that there is a URL following a space in the string you can do something like this:

s is the string containg the url

>>> t = s[s.find("http://"):]
>>> t = t[:t.find(" ")]

otherwise you need to check if find returns -1 or not.

sinzi
  • 141
  • 3
4

You can use BeautifulSoup.

def extractlinks(html):
    soup = BeautifulSoup(html)
    anchors = soup.findAll('a')
    links = []
    for a in anchors:
        links.append(a['href'])
    return links

Note that the solution with regexes is faster, although will not be as accurate.

Seb
  • 17,141
  • 7
  • 38
  • 27
  • 13
    Sebastian: I know BeautifulSoup but the problem is that it will only extract anchored URLs. I'm trying to search plain text for anything URL like. Thanks for the suggestion though. – jkp Feb 06 '09 at 12:18
3

I'm late to the party, but here is a solution someone from #python on freenode suggested to me. It avoids the regex hassle.

from urlparse import urlparse

def extract_urls(text):
    """Return a list of urls from a text string."""
    out = []
    for word in text.split(' '):
        thing = urlparse(word.strip())
        if thing.scheme:
            out.append(word)
    return out
Shatnerz
  • 2,353
  • 3
  • 27
  • 43
  • This relies on the existence of a url scheme. That would be ideal in some well-structured situations, but entirely useless in others. Example of the former: user-generated strings. – Hassan Baig Apr 30 '18 at 06:36
1

There is another way how to extract URLs from text easily. You can use urlextract to do it for you, just install it via pip:

pip install urlextract

and then you can use it like this:

from urlextract import URLExtract

extractor = URLExtract()
urls = extractor.find_urls("Let's have URL stackoverflow.com as an example.")
print(urls) # prints: ['stackoverflow.com']

You can find more info on my github page: https://github.com/lipoja/URLExtract

NOTE: It downloads list of TLDs from iana.org to keep you up to date. But if the program does not have internet access then its not for you.

This approach is similar as in urlextractor (mentioned above), but my code is recent, maintained and I am open for any suggestions (new features).

Jan Lipovský
  • 341
  • 2
  • 5
  • Not to be a nitpick, but `stackoverflow.com` is *not* a URL, it's a hostname. Frankly I don't like applications that try to look smarter by turning everything they possibly can into link. With TLDs these days it's even much easier to make a mistake. – Alois Mahdal Sep 29 '17 at 16:27
  • I agree. In the example it is not URL just hostname. I wrote this lib because it is a way how I figured out how to extract URL (hostname) from plain text. TLD is the only part of URL or hostname that is in plain text easily recognizable, its easy to match. And when surrounding text meets certain criteria you can say that you found URL – Jan Lipovský Oct 13 '17 at 12:21
  • *In early days, we used to talk about command.com, but today, I'm much more likely to mention some example.sh or some file I named example.name (because it contains name). Sometimes I forget space after period.net gain is roughly the same, though.* I know, this is not common except technical jargon, but still, there are no URLs so software that sees them is wrong and I don't want *my* comm to be broken. (It's already enough what some readers do with smileys---dare you end parenthesis with B). – Alois Mahdal Oct 13 '17 at 17:18
  • 1
    _"... but still, there are no URLs so software that sees them is wrong"_ - I see it differently. Hostname is part of URL. Let say that this software can find hostnames by default. In your example above I can see four hostnames: command.com, example.sh, example.name, period.net. And I think that those four are valid hostnames. Or am I wrong? On the other hand, this software was written as python class with methods to specify search of URL. So you can set it to fit your needs. – Jan Lipovský Nov 25 '17 at 22:34
  • What I'm saying is that whether it *is* or *is not* a hostname depends on the context. Compare: `Jan Lipovský`, and `Mon Jan 24`. Humans are excellent in detecting patterns in context, machines are poor at that. By matching just hostname you are implementing "naive" rule, which will inevitably (just make a typo or use strange enough jargon) be wrong, causing clash with human intuition, ie. your SW looks sillly and you lose UX points. – Alois Mahdal Nov 28 '17 at 18:30
0
import re
text = '<p>Please click <a href="http://www.dr-chuck.com">here</a></p>'
aa=re.findall('href="(.+)"',text)
print(aa)
zlin
  • 195
  • 2
  • 11