14

How can I parse text and find all instances of hyperlinks with a string? The hyperlink will not be in the html format of <a href="http://test.com">test</a> but just http://test.com

Secondly, I would like to then convert the original string and replace all instances of hyperlinks into clickable html hyperlinks.

I found an example in this thread:

Easiest way to convert a URL to a hyperlink in a C# string?

but was unable to reproduce it in python :(

Community
  • 1
  • 1
TimLeung
  • 3,459
  • 6
  • 42
  • 59

5 Answers5

23

Here's a Python port of Easiest way to convert a URL to a hyperlink in a C# string?:

import re

myString = "This is my tweet check it out http://tinyurl.com/blah"

r = re.compile(r"(http://[^ ]+)")
print r.sub(r'<a href="\1">\1</a>', myString)

Output:

This is my tweet check it out <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a>
Community
  • 1
  • 1
maxyfc
  • 11,167
  • 7
  • 37
  • 46
  • 3
    It can be improved by adding support for https or ftp URLs... Also, I believe the scheme (http) is case-INsensitive. – bortzmeyer Apr 06 '09 at 08:38
  • See http://stackoverflow.com/questions/1986059/gruber-s-url-regular-expression-in-python for hopefully a better regular expression. – tripleee Oct 10 '14 at 11:27
10

Here is a much more sophisticated regexp from 2002.

@yoniLavi minified this to:

re.compile(r'\b(?:https?|telnet|gopher|file|wais|ftp):[\w/#~:.?+=&%@!\-.:?\\-]+?(?=[.:?\-]*(?:[^\w/#~:.?+=&%@!\-.:?\-]|$))')
dfrankow
  • 20,191
  • 41
  • 152
  • 214
  • 2
    I found it very useful too, and minified it to: `re.compile(r'\b(?:https?|telnet|gopher|file|wais|ftp):[\w/#~:.?+=&%@!\-.:?\\-]+?(?=[.:?\-]*(?:[^\w/#~:.?+=&%@!\-.:?\-]|$))')` – yoniLavi Apr 29 '13 at 12:43
  • 3
    Great stuff, but what if the URL does not have the http:// prefix. Usually we don't specify that part any more in emails and social media. – dlink Jan 09 '16 at 18:43
5

Django also has a solution that doesn't just use regex. It is django.utils.html.urlize(). I found this to be very helpful, especially if you happen to be using django.

You can also extract the code to use in your own project.

Erock
  • 770
  • 7
  • 10
Kekoa
  • 27,892
  • 14
  • 72
  • 91
2

Jinja2 (Flask uses this) has a filter urlize which does the same.

Docs

jmoz
  • 7,846
  • 5
  • 31
  • 33
0

I would recommend to have a look also on urlextract

You can install it running: pip install urlextract

from urlextract import URLExtract

extractor = URLExtract()
urls = extractor.find_urls("Text with URLs. Let's have URL janlipovsky.cz as an example.")
print(urls) # prints: ['janlipovsky.cz']

The main advantage is that urlextract will find URLs without specifying schema (http, ftp, etc.) it has also a lot of configuration options to tune in the extractor to fit your needs. Everything can be found in documentation.

Jan Lipovský
  • 341
  • 2
  • 5