Find Hyperlinks in Text using Python (twitter related)

Question

How can I parse text and find all instances of hyperlinks with a string? The hyperlink will not be in the html format of <a href="http://test.com">test</a> but just http://test.com

Secondly, I would like to then convert the original string and replace all instances of hyperlinks into clickable html hyperlinks.

I found an example in this thread:

Easiest way to convert a URL to a hyperlink in a C# string?

but was unable to reproduce it in python :(

You should use http://example.com for example URLs. See http://en.wikipedia.org/wiki/Example.com — John Fouhy, Apr 06 '09 at 03:29
Thanks John! I did not know that those are official example domains. — Dan Rosenstark, Dec 24 '09 at 13:40
See: https://stackoverflow.com/questions/9760588/how-do-you-extract-a-url-from-a-string-using-python/31952097#31952097 — Paolo Rovelli, Aug 11 '15 at 21:31

score 23 · Accepted Answer · edited May 23 '17 at 12:30

23

Here's a Python port of Easiest way to convert a URL to a hyperlink in a C# string?:

import re

myString = "This is my tweet check it out http://tinyurl.com/blah"

r = re.compile(r"(http://[^ ]+)")
print r.sub(r'<a href="\1">\1</a>', myString)

Output:

This is my tweet check it out <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a>

edited May 23 '17 at 12:30

Community

1
1

answered Apr 06 '09 at 02:53

maxyfc

11,167
7
37
46

3

It can be improved by adding support for https or ftp URLs... Also, I believe the scheme (http) is case-INsensitive. – bortzmeyer Apr 06 '09 at 08:38
See http://stackoverflow.com/questions/1986059/gruber-s-url-regular-expression-in-python for hopefully a better regular expression. – tripleee Oct 10 '14 at 11:27

dfrankow · Answer 2 · 2021-08-27T18:20:31.777

10

Here is a much more sophisticated regexp from 2002.

@yoniLavi minified this to:

re.compile(r'\b(?:https?|telnet|gopher|file|wais|ftp):[\w/#~:.?+=&%@!\-.:?\\-]+?(?=[.:?\-]*(?:[^\w/#~:.?+=&%@!\-.:?\-]|$))')

edited Aug 27 '21 at 18:20

answered Jan 20 '10 at 15:45

dfrankow

20,191
41
152
214

2

I found it very useful too, and minified it to: `re.compile(r'\b(?:https?|telnet|gopher|file|wais|ftp):[\w/#~:.?+=&%@!\-.:?\\-]+?(?=[.:?\-]*(?:[^\w/#~:.?+=&%@!\-.:?\-]|$))')` – yoniLavi Apr 29 '13 at 12:43
3

Great stuff, but what if the URL does not have the http:// prefix. Usually we don't specify that part any more in emails and social media. – dlink Jan 09 '16 at 18:43

score 5 · Answer 3 · edited Jul 28 '16 at 16:27

5

Django also has a solution that doesn't just use regex. It is django.utils.html.urlize(). I found this to be very helpful, especially if you happen to be using django.

You can also extract the code to use in your own project.

edited Jul 28 '16 at 16:27

Erock

770
7
10

answered Jan 24 '12 at 06:16

Kekoa

27,892
14
72
91

score 2 · Answer 4 · answered Oct 25 '12 at 22:57

2

Jinja2 (Flask uses this) has a filter urlize which does the same.

Docs

answered Oct 25 '12 at 22:57

jmoz

7,846
5
31
33

score 0 · Answer 5 · answered Jan 02 '23 at 14:04

I would recommend to have a look also on urlextract

You can install it running: pip install urlextract

from urlextract import URLExtract

extractor = URLExtract()
urls = extractor.find_urls("Text with URLs. Let's have URL janlipovsky.cz as an example.")
print(urls) # prints: ['janlipovsky.cz']

The main advantage is that urlextract will find URLs without specifying schema (http, ftp, etc.) it has also a lot of configuration options to tune in the extractor to fit your needs. Everything can be found in documentation.

Find Hyperlinks in Text using Python (twitter related)

5 Answers5

Linked

Related