replace URLs in text with links to URLs

Question

Using Python I want to replace all URLs in a body of text with links to those URLs, like what Gmail does. Can this be done in a one liner regular expression?

Edit: by body of text I just meant plain text - no HTML

One would assume though, that you are *creating* HTML, since plain text has no special notation of a link vs. a URL. So you would convert `http://blah.com/page/ref/something?param=foo` found in your plain text to `http://blah.com/page/ref/something?param=foo`, yes? — PaulMcG, Nov 13 '09 at 08:34
the answers so far have focussed on matching the URL. How about replacing it with the link? — hoju, Nov 13 '09 at 09:22

score 11 · Accepted Answer · edited May 23 '17 at 11:54

11

You can load the document up with a DOM/HTML parsing library ( see html5lib ), grab all text nodes, match them against a regular expression and replace the text nodes with a regex replacement of the URI with anchors around it using a PCRE such as:

/(https?:[;\/?\\@&=+$,\[\]A-Za-z0-9\-_\.\!\~\*\'\(\)%][\;\/\?\:\@\&\=\+\$\,\[\]A-Za-z0-9\-_\.\!\~\*\'\(\)%#]*|[KZ]:\\*.*\w+)/g

I'm quite sure you can scourge through and find some sort of utility that does this, I can't think of any off the top of my head though.

Edit: Try using the answers here: How do I get python-markdown to additionally "urlify" links when formatting plain text?

import re

urlfinder = re.compile("([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}|((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\\.)[-A-Za-z0-9\\.]+):[0-9]*)?/[-A-Za-z0-9_\\$\\.\\+\\!\\*\\(\\),;:@&=\\?/~\\#\\%]*[^]'\\.}>\\),\\\"]")

def urlify2(value):
    return urlfinder.sub(r'<a href="\1">\1</a>', value)

call urlify2 on a string and I think that's it if you aren't dealing with a DOM object.

edited May 23 '17 at 11:54

Community

1
1

answered Nov 13 '09 at 06:43

meder omuraliev

183,342
71
393
434

So, what is not allowed in a url? – Amarghosh Nov 13 '09 at 06:45
Btw, what if the link is already inside href attribute of an anchor tag? – Amarghosh Nov 13 '09 at 06:46
When you're inside the text node, make sure the parent or ancestor isn't an anchor. – meder omuraliev Nov 13 '09 at 06:49
@Amarghosh - The coder's regex dictates what a "url" is, of course I agree with you that it's going to be difficult making one that's compatible for all types of urls but even the most basic will support 95% of urls. – meder omuraliev Nov 13 '09 at 06:52
2

Should the regexp be: `"([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}|((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\\.)[-A-Za-z0-9\\.]+(:[0-9]*)?/[-A-Za-z0-9_\\$\\.\\+\\!\\*\$\$,;:@&=\\?/~\\#\\%]*[^]'\\.}>\\),\\\"]"`, this one does not compile. – Herbert Feb 15 '17 at 14:36

crizCraig · Answer 2 · 2011-07-09T03:58:08.567

I hunted around a lot, tried these solutions and was not happy with their readability or features, so I rolled the following:

_urlfinderregex = re.compile(r'http([^\.\s]+\.[^\.\s]*)+[^\.\s]{2,}')

def linkify(text, maxlinklength):
    def replacewithlink(matchobj):
        url = matchobj.group(0)
        text = unicode(url)
        if text.startswith('http://'):
            text = text.replace('http://', '', 1)
        elif text.startswith('https://'):
            text = text.replace('https://', '', 1)

        if text.startswith('www.'):
            text = text.replace('www.', '', 1)

        if len(text) > maxlinklength:
            halflength = maxlinklength / 2
            text = text[0:halflength] + '...' + text[len(text) - halflength:]

        return '<a class="comurl" href="' + url + '" target="_blank" rel="nofollow">' + text + '<img class="imglink" src="/images/linkout.png"></a>'

    if text != None and text != '':
        return _urlfinderregex.sub(replacewithlink, text)
    else:
        return ''

You'll need to get a link out image, but that's pretty easy. This is specifically for user submitted text like comments which I assume is usually what people are dealing with.

I looked around as well, including a few frameworks that implemented their own linkify function, and I found this one to be the most readable for non-complex purposes. — JayD3e, Sep 07 '12 at 21:28
This is by far the best regex based text-->HTML script I've found on StackOverflow. It's significantly better than even the accepted answer. I only had to change the line `text = unicode(URL)` to `text = str(URL)` for Python 3.x; and remove the link out image. — MKANET, Apr 20 '20 at 03:38

echo · Answer 3 · 2009-11-13T07:51:21.907

1

/\w+:\/\/[^\s]+/

edited Nov 13 '09 at 07:51

answered Nov 13 '09 at 07:45

echo

7,737
2
35
33

score 0 · Answer 4 · edited May 23 '17 at 12:17

0

When you say "body of text" do you mean a plain text file, or body text in an HTML document? If you want the HTML document, you will want to use Beautiful Soup to parse it; then, search through the body text and insert the tags.

Matching the actual URLs is probably best done with the urlparse module. Full discussion here: How do you validate a URL with a regular expression in Python?

edited May 23 '17 at 12:17

Community

1
1

answered Nov 13 '09 at 06:48

steveha

74,789
21
92
117

score 0 · Answer 5 · answered Nov 13 '09 at 06:51

Gmail is a lot more open, when it comes to URLs, but it is not always right either. e.g. it will make www.a.b into a hyperlink as well as http://a.b but it often fails because of wrapped text and uncommon (but valid) URL characters.

See appendix A. A. Collected BNF for URI for syntax, and use that to build a reasonable regular expression that will consider what surrounds the URL as well. You'd be well advised to consider a couple of scenarios where URLs might end up.

replace URLs in text with links to URLs

5 Answers5

Linked